Several leading market research firms (e.g., Kantar, IRI, Nielsen) collect so called consumer panel data and sell reports and analyses based on these data to interested parties. A large number of households are recruited to join the panel and they record all their grocery/HBA purchases. The idea of a panel is that we have repeated observations of the same people. The ability to track what individual households are purchasing over time can give important insights into what behavioural changes lie beneath observed changes in aggregate sales data. (terms panellist and household are used interchangeably) Nowadays, it is common for panellists to record their purchasing by scanning product barcodes via an app on their smartphones.
The objective of this notebook is to provide an introduction to basic analyses we can undertake using panel data. After briefing describing the data we will be working with, we present some preliminary aggregate-level analyses. Next we introduce some basic brand performance measures and consider the simple analyses that describe the variation we observe in buyer behaviour in a given time period. This analysis focuses on one brand at a time; next we consider some basic analyses that describe consumers’ buying of multiple brands in a category. We finish with some basic analyses that describe how buyer behaviour evolves over time, both for established products and new products.
Note: Whereas consumer panel data gives us information on the purchasing of a sample of customers for the whole category, the data in a firm’s transaction database gives us complete information on the purchasing of our products (but not those of our competitors). At a fundamental level, the types of reports developed by market research firms over the past 60+ years are a good starting point for the types of reports a firm should create as it seeks to understand the buying behaviour of its customers.
A traditional consumer panel works in the following manner:
When an individual first joins the panel, they complete a detailed questionnaire. A section of this questionnaire focuses on the demographics of their household. (This is typically updated once a year.)
After each shopping trip, each panellist records their purchases, scanning the barcode associated with each product and recording other information such as where the purchase was made (store or channel), the price paid, and the use of promotional deals. Twenty years ago, this would have been done using a custom handheld barcode scanner provided by the market research firm. These days, it is more common to use a smartphone app.
These data are uploaded to the research firm’s servers and merged with the purchases records of the other panellists. Each barcode is matched with detailed product information (e.g., category, brand, size, flavour) and this information is also stored in the database. The analyst can then create (typically product-category-specific) datasets for further analysis that tell us what each panellist purchased, when and where, and associated transaction- and/or product-specific data that may be of interest.
Panellists drop out of the panel all the time, and the research firm will be recruiting replacement households on a regular basis. When creating a dataset for further analysis, it is generally desirable to work with a so-called static sample of panellists, which comprises all those panellists active using the time period of interest; new panellists, as well as those that dropped out during the given time period, are excluded.
2 Imports
2.1 Import Packages
Code
import polars as plimport altair as altimport numpy as npfrom great_tables import GT, style, loc, mdimport gc# https://altair-viz.github.io/user_guide/display_frontends.html# alt.renderers.enable('mimetype')# alt.JupyterChart.enable_offline()# alt.renderers.enable("jupyter", offline=True)
2.2 Import Panel Data
We will make use of two datasets as we explore the basic types of summaries of buyer behaviour that can be created using consumer panel data. The first contains data on the purchasing of established brands in a mature product category, while the second contains data on the purchasing of a new product. Both datasets were created using static samples. While these are small datasets and contain a subset of the information available in the research firm’s databases, they are more than sufficient to convey the logic of creating the key summaries of buyer behaviour. Neither dataset includes data on the demographics of each panellist. As such, we will not consider how to create reports that explore how behaviours differ across demographic groups (e.g., by age or geography). However, anyone comfortable with the analyses undertaken in this note should be able to work out how to create such reports for themselves.
2.2.1 Dataset 1
The file edible_grocery.csv contains two years of data from a panel of 5021 households on their purchasing in an unnamed edible grocery product category. (We intentionally do not identity the category and the associated brand names.) There are 119 SKUs in this category. 91 SKUs are associated with the four largest brands in the category, which we have named Alpha, Bravo, Charlie, and Delta. The remaining SKUs belong to very low-share brands and we grouped them under the brand Other.
Note: SKU - A stock-keeping unit (SKU) is a unique combination of the attributes (e.g., brand, package type, package size, flavour) that define the products in the category.
Each record in this file consists of seven fields:
panel_id A unique identifier for each household.
trans_id A unique identifier for the purchase occasion.
week The week number in which the purchase occurred. Week 1 corresponds to the calendar week starting on January 1, 20xy. Week 53 corresponds to the calendar week starting on December 31, 20xy.
sku_id The SKU code.
units The number of units purchased on the particular purchase occasion.
price The price per unit paid at the point of purchase.
brand The brand associated with the SKU purchased.
The associated file sku_weight.csv gives us the weight (in grams) of each SKU. There are two fields: sku_id and weight.
“Kiwi Bubbles” is a masked name for a shelf-stable juice drink, aimed primarily at children, which is sold as a multipack with several single-serve containers bundled together. Prior to national launch, it underwent a year-long test conducted in two of IRI’s BehaviorScan markets. The file kiwibubbles_trans.csv contains purchasing data for the new product, drawn from 1300 panellists in Market 1 and 1499 panellists in Market 2. (The purchasing of other brands in the category has been excluded from the dataset.)
Each record in this file consists of five fields:
ID A unique identifier for each household.
Market 1 or 2.
Week The week in which the purchase occurred.
Day The day of the week in which the purchase occurred. (The product was launched on day 1 of week 1.)
Units The number of units of the new product purchased on the particular purchase occasion.
# Altair Weekly Line Plotdef weekly_plot(dataframe, y, year=2, color=None, title="", y_axis_label="", pct=False, legend=False):# Configure the color encoding only if color is providedif color isnotNone: color_encoding = alt.Color(f'{color}:N', # N = a discrete unordered category legend=alt.Legend(title=color) if legend elseNone# Add legend conditionally )else: color_encoding = alt.Color() # No color encoding chart = alt.Chart(dataframe).mark_line(strokeWidth=1).encode( x = alt.X('week', axis=alt.Axis( values=np.arange(0, (year*52) +1, 13), # Explicitly specify quarter-end weeks labelExpr="datum.value", # Show only these labels title='Week' ) ), y = alt.Y(f'{y}:Q', # Q = a continuous real-valued quantity title=y_axis_label, axis=alt.Axis(format="$,.0f") ifnot pct else alt.Axis(format=",.0%") ), color = color_encoding ).properties( width=650, height=250, title=title ).configure_view( stroke=None ).configure_axisY(# grid=False # turn off y-axis grid if required )return chart # alt.JupyterChart(chart)
def freq_dist_plot( data, column, bin_edges, labels, x_title, y_title, chart_title, subtitle, width=650, height=250, label_angle=0, left_closed=True, compute_rel_freq=True):""" Creates a standardized Altair bar chart for relative frequency distribution plots. Parameters: - data (Polars LazyFrame or DataFrame): Input dataset. - column (str): Column to analyze for distribution. - bin_edges (array-like): Edges for binning. - labels (list of str): Labels for the bins. - x_title (str): Title for the x-axis. - y_title (str): Title for the y-axis. - chart_title (str): Main title for the chart. - subtitle (str): Subtitle for the chart. - width (int, optional): Width of the chart. Default is 650. - height (int, optional): Height of the chart. Default is 250. - label_angle (int, optional): Angle for x-axis labels. Default is 0. - left_closed (bool, optional): Whether bins are left-closed. Default is True. - compute_rel_freq (bool, optional): Whether to compute relative frequencies. Default is True. Returns: - alt.Chart: The generated Altair chart. """# Apply binning to the data binned_data = data.with_columns( pl.col(column).cut(bin_edges, labels=labels, left_closed=left_closed).alias("cut") )# Optionally compute relative frequenciesif compute_rel_freq: binned_data = ( binned_data .group_by("cut") .agg(pl.col("cut").count().alias("Frequency")) .with_columns( (pl.col("Frequency") / pl.col("Frequency").sum()).alias("% of Total") ) .collect() )# Create the Altair chart chart = alt.Chart(binned_data).mark_bar().encode( x=alt.X("cut:O", axis=alt.Axis(labelAngle=label_angle, title=x_title), sort=labels), y=alt.Y("% of Total:Q", axis=alt.Axis(format=".0%", title=y_title)), ).properties( width=width, height=height, title={"text": chart_title, "subtitle": subtitle}, )return chart
3 Preliminaries
Before we start analysing household-level behaviour, let us first get a sense of the general sales patterns observed in this category.
Our initial objective is to plot weekly revenue for all the brands and the overall category, respectively, and then plot weekly (volume) market shares for target brands.
with pl.StringCache(): pct_volume_plot = ( weekly_vol_summary('All', grocery_lf) .group_by('week') .agg( pl.col('Weekly Volume').sum().alias('Total Weekly Volume') ) .join( other=weekly_vol_summary(['Alpha', 'Bravo'], grocery_lf), on='week', ) .filter( (pl.col('brand').is_in(['Alpha', 'Bravo'])) ) .with_columns(# compute brand wise % of total volume sale (pl.col('Weekly Volume') / pl.col('Total Weekly Volume')).alias('pct_volume') ) ).collect()weekly_plot(dataframe=pct_volume_plot, y='pct_volume', color='brand', title="Volume Market Share - Alpha vs. Bravo", y_axis_label="% of Weekly Total", pct=True, legend=True)
There appears to be a high level of competition between these two brands. What is the correlation in their market shares?
We observe that there is a strong negative correlation between the shares of Alpha and Bravo: an increase in one brand’s share is associated with a corresponding decrease in the share of the other brand.
base = alt.Chart(annual_change_share).encode( x=alt.X("brand:N", title='Brands', axis=alt.Axis(labelAngle=0) ), y=alt.Y("% Change:Q", axis=alt.Axis(format=".0%") ), color=alt.condition( alt.datum["% Change"] >0, # fixed this alt.value("green"), # The positive color alt.value("red") # The negative color ), text=alt.Text('% Change',format=(".1%") )).properties( width=650, height=250, title='% Change in Annual Market Share')chart = base.mark_bar() + base.mark_text( align='center', baseline=alt.expr(alt.expr.if_(alt.datum['% Change'] >=0, 'bottom', 'top')), dy=alt.expr(alt.expr.if_(alt.datum['% Change'] >=0, -3, 3)), dx=0)chart.show()
3.5 Average SKU Price
Code
avg_sku_price = grocery_lf.select( ['week', 'sku_id', 'price', 'brand']).filter( (pl.col('week') <=52) &# pricing in the first year (pl.col('brand') =='Alpha') # Filter by Alpha brand ).group_by('brand', 'sku_id').agg( pl.col('price').mean()).drop('brand').sort( pl.col('sku_id').cast(pl.Int8))( GT(avg_sku_price.collect(), rowname_col='sku_id') .tab_header(title='Alpha - Average SKU Retail Prices') .tab_stubhead(label="SKU IDs") .fmt_currency() .data_color( columns=['price'], domain=[1, 15], palette=['white', 'rebeccapurple'] ) .cols_label(price='Average Price ($)'))
Alpha - Average SKU Retail Prices
SKU IDs
Average Price ($)
4
$1.85
5
$3.44
6
$2.61
7
$5.10
8
$12.43
9
$6.59
10
$10.49
11
$2.27
12
$2.79
14
$1.99
15
$3.42
16
$2.58
17
$2.80
18
$3.27
19
$2.62
20
$3.43
21
$2.99
22
$2.09
Code
avg_sku_price.drop(pl.col('sku_id')).describe()
shape: (9, 2)
statistic
price
str
f64
"count"
18.0
"null_count"
0.0
"mean"
4.042794
"std"
2.947076
"min"
1.84705
"25%"
2.582056
"50%"
2.99
"75%"
3.438095
"max"
12.43
Garbage Collect
Code
exceptions = ['grocery_lf', 'sku_lf', 'kiwi_lf', 'In', 'exceptions', 'active_variables']active_variables = [ var for var, value inglobals().items()ifnot var.startswith('_') # Exclude variables that start with "_"and var notin exceptions # Exclude variables in the exceptions listandisinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only]for var in active_variables:delglobals()[var]del active_variables, exceptions, vargc.collect()
2955
4 Exploring Variation in Buyer Behaviour
4.1 Introduction
The objective here is to explore how to perform the basic analyses that describe the variation we observe in buyer behaviour in a given time period. We will continue to work with the edible grocery dataset, exploring both purchase frequency and spend.
Before we can doing any analysis, we need to create some summary datasets. The first will summarise how many times each panellist purchased each brand as well as in the category. The second will summarise how much each panellist spent on each brand and in the category.
On this one shopping trip (trans_id = 844), panellist 3102021 purchased a total of six items: three packs of SKU 5, two packs of SKU 15, and one pack of SKU 89. They purchased two different brand, Alpha and Delta.
Why are three lines for SKU 5 and not one line with units = 3. This is simply a function of how the items were scanned at the checkout. Some checkout operators will scan the three items separately; this would result in three lines in the transaction file, each with units = 1. Others will scan the item and press “3” on their till, resulting in one line in the transaction file with units = 3.
By convention, this purchase occasion is recorded as one category transaction, one Alpha transaction, and one Delta transaction. The number of units of Alpha purchased is five. When we say that the panellist made one category transaction, we mean they purchased at least one item in the category on that shopping trip. When we say that the panellist made one Alpha transaction, we mean they purchased at least one item associated with the brand on that shopping trip.
In order to analyse buyer behaviour in terms of transactions, we need to know the number of brand and category transactions for each person. We cannot work directly with the dataset we have been using; we effectively need to collapse the five rows associated with Alpha into one. This will require some intermediate analysis, which we undertake in the following manner.
Result of the intermediate prep should produce a dataset that summarizes transaction 844 by panellist 3102016 as 1 occasion of Alpha brand purchase and 1 occasion of Delta brand purchase:
4.2 Panellist-Level Data Preparation
One row per transaction, with a binary indicator of whether or not each brand was purchased.
Filter for year == 1, group by transaction ID, panel ID and brand, then count unique panel_ids per transaction and brand
Each panelist has done some number of transactions. Group unique transactions and obtain the count of transactions occasions in which any listed brand was purchased
Note: this is not a sum of total units (by brand) purchased per panellist.
Note: Because there are multiple entires for the same transaction ID, we are not doing a count of brand instance per panelist (without trans_id grouping) as this would duplicate the occasions in some cases.
Produce a dataset where there is just one row per transaction, with a binary indicator of whether or not each brand was purchased.
Create a new column to track category purchase. Each row is an instance of at least one brand purchase so category column is set to 1.
Panellist-level transaction summary
The function below executes a query plan which summarizes each panellist’s purchase occasions (transactions) for each brand
The function below executes a query plan which summarises each panellist’s volume purchasing in year 1.
Code
def vol_summary(brand, lf, year):with pl.StringCache(): group_vol = lf.filter( (pl.col('week') <= (year *52)) & (pl.col('week') > ((year -1) *52)) ).join( other=sku_lf, left_on='sku_id', right_on='sku_id' ).drop( pl.col('week','sku_id') ).with_columns(# volume column that is the product of weight of each SKU and the units of SKU sold (((pl.col('units') * pl.col('weight'))/1000)).alias('volume') # # weight from grams to kilograms ).drop( pl.col('units', 'price', 'weight') )if brand =="Category":# Panellist-level category volume sales summary summary = group_vol.drop( pl.col('brand') ).group_by('panel_id' ).agg( pl.col('volume').sum() )else:# Panellist-level brand volume sales summary summary = group_vol.group_by('panel_id', 'brand' ).agg( pl.col('volume').sum() ).filter( pl.col('brand') == brand ).drop('brand')return summary
Code
def vol_pivot(lf, year):with pl.StringCache(): group_vol = lf.filter( (pl.col('week') <= (year *52)) & (pl.col('week') > ((year -1) *52)) ).join( other=sku_lf, left_on='sku_id', right_on='sku_id' ).drop( pl.col('week','sku_id') ).with_columns(# volume column that is the product of weight of each SKU and the units of SKU sold (((pl.col('units') * pl.col('weight'))/1000)).alias('volume') # # weight from grams to kilograms ).drop( pl.col('units', 'price', 'weight') ).collect().pivot( on='brand', index='panel_id', values='volume', aggregate_function='sum' ).select('Alpha', 'Bravo', 'Charlie', 'Delta', 'Other' )return group_vol
Looking at panellist-level spend and panellist-level volume purchasing, we note that for each row, Category equals the sum of the brand numbers, as we would expect. But this is not always the case in panellist-level transactions. Why is the sum of the brand-specific numbers sometimes greater than the associated category number?
Answer: Because a single transactions (once grouped) can have occasions of multiple brand purchases but the transaction only counts as one occasion of a category purchase. Consider the panellist 3102016 and her transaction ID 844, The 4 entires on the one transaction ID count as a single category transaction, a single Alpha brand transaction and a single Delta transaction. The category transaction count is not a sum of brand purchase transaction count because regardless of the brand(s) and quantities purchased, the panelist purchased the category once in that transaction only.
4.3 Examining Purchase Frequency
Two standard brand performance metrics that summarize purchasing behaviour are penetration and purchases per buyer (PPB).
Penetration is the percentage of households buying the product/category at least once in the given time period.
In order to compute this, we need to know the number of panellists buying the product at least once in the time period of interest and the size of the panel during this period.
Purchases per buyer (PPB) is the average number of times (separate shopping trips) the product/category was purchased (in the given time period) by those households that made at least one product/category purchase (in the given time period).
i.e. purchases per buyer (PPB) is the average number of times the product was purchased (in the given time period) per buyer.
This is computed as the total number of purchase occasions on which the product was purchased by the panellists (in the time period of interest) divided by the number of panellists that purchased the product at least once (in the time period of interest).
Looking at panellist-level transaction summary, we see that there are 4574 rows in this table, meaning that we have summary data on the purchasing of 4574 households in year 1. But there are 5021 households in the panel. What has happened to the remaining 447 households? They did not make any category purchase during the year. (But they will have purchased in other categories.)
Code
# the number of panellists who purchased each brand at least once in year 1buyers = trans_pivot(grocery_lf, 1).count().select('Alpha', 'Bravo', 'Charlie', 'Delta', 'Other', 'Category').unpivot( variable_name='brand', value_name='buyers')( GT(buyers, rowname_col='brand') .tab_header(title='Number of Buyers in Year 1') .tab_stubhead(label="Brands/Category") .fmt_integer() .data_color( columns=['buyers'], rows=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[100, 3_000], palette=['white', 'rebeccapurple'] ) .cols_label(buyers='# Of Buyers'))
Number of Buyers in Year 1
Brands/Category
# Of Buyers
Alpha
2,624
Bravo
2,562
Charlie
813
Delta
380
Other
176
Category
4,574
Code
# On how many purchase occasions did the brand/category buyers of each brand/category buy that brand/category?# i.e, On how many purchase occasions did the 2624 buyers of Alpha buy Alpha?transactions = trans_pivot(grocery_lf, 1).sum().select('Alpha', 'Bravo', 'Charlie', 'Delta', 'Other', 'Category').unpivot( variable_name='brand', value_name='transactions')( GT(transactions, rowname_col='brand') .tab_header(title='Purchase Occasions by Buyers') .tab_stubhead(label="Brands/Category") .fmt_integer() .data_color( columns=['transactions'], rows=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[400, 10_000], palette=[ 'white', 'rebeccapurple'] ) .cols_label(transactions='# Of Transactions'))
Purchase Occasions by Buyers
Brands/Category
# Of Transactions
Alpha
9,060
Bravo
8,255
Charlie
1,882
Delta
859
Other
422
Category
20,030
Code
panel_size =5021# Penetration is simply the number of brand buyers divided by the number of panellists, expressed as a percentagepenetration = buyers.with_columns( (pl.col('buyers') / panel_size).alias('penetration'))# Purchases per buyer (PPB) is simply to total number of purchases occasions divided by the number of buyers.ppb = transactions.with_columns( (pl.col('transactions') / penetration.get_column('buyers')).alias('ppb'))
We see that 91% of the households in the panel purchased in the category at least once in year 1. (This is a widely purchased product category.) On average, they purchased in the category 4.4 times that year. Looking at Alpha, we see that 52% of the households in the panel purchased the brand at least once, purchasing it on average 3.5 times.
Penetration and PPB are in fact summary measures of an important but frequently overlooked summary of buyer behaviour: the distribution of purchase frequency. We first explore how to create this summary of category purchasing and then consider how to create such a summary of brand purchasing.
4.4 Distribution of Purchase
4.4.1 Distribution of Category Purchase Frequency
Looking at the panellist-level transaction summary, the distribution of category purchasing is determined by counting how many households made one category purchase (panellists 3102011, 3102046, etc.), two category purchases (panellists 3102012, 3102021, etc.), and so on.
Code
# Category Transaction Distributioncategory_trans_dist = ( trans_summary('Category', grocery_lf, 1) .group_by(pl.col('# of Purchases')) .agg(pl.len().alias("Frequency")) .collect() # Add zero purchase row to the top of the data series with a new column name 'Frequency' .pipe(lambda df: df.vstack( pl.DataFrame( {"# of Purchases": [0],"Frequency": [panel_size - df["Frequency"].sum()], } ).with_columns( pl.col("# of Purchases").cast(pl.UInt32), pl.col('Frequency').cast(pl.UInt32) ) ) ) .sort(by='# of Purchases')# Add percentage of total column .with_columns( (pl.col('Frequency') / pl.col('Frequency').sum()).alias("% of Total") ) )( GT(category_trans_dist, rowname_col='# of Purchases') .tab_header(title='Category Purchase Distribution') .tab_stubhead(label="# of Purchases") .fmt_integer(columns='Frequency') .fmt_percent(columns='% of Total') .data_color( columns=['% of Total'], domain=[0, 0.2], palette=['white', 'rebeccapurple'] ) )
Category Purchase Distribution
# of Purchases
Frequency
% of Total
0
447
8.90%
1
667
13.28%
2
759
15.12%
3
757
15.08%
4
612
12.19%
5
514
10.24%
6
381
7.59%
7
247
4.92%
8
209
4.16%
9
127
2.53%
10
85
1.69%
11
57
1.14%
12
50
1.00%
13
27
0.54%
14
31
0.62%
15
15
0.30%
16
16
0.32%
17
7
0.14%
18
4
0.08%
19
2
0.04%
20
4
0.08%
22
1
0.02%
25
1
0.02%
27
1
0.02%
If we wish to create a visual representation of this distribution, it is tempting to simply plot the data in % of total column. However, the resulting plot would be misleading as some purchase frequencies are missing in the data. In particular, we see that no one made 21 category purchases; ditto for 23, 24, and 26. One solution is to insert manually the missing number of purchases levels with 0 frequencies and then plot the data. However, the observed (relative) frequencies in the right tail are so small that they do not show up in a plot. We can therefore create a right-censored distribution. Here we have chosen 15 as the (right) censoring point; we see that 51 households (or 1% of the panellists) made 15 or more category purchases in year 1.
There is nothing magical about our choice of 15. Generally, the choice of censoring point is a function of how many bins you wish to display and the height of the right-most bar with which you feel comfortable.
Note that while the right-censored distribution is useful when creating a summary figure or table, it is of limited value beyond that. For example, it is not possible to compute the mean purchase frequency (and therefore PPB) from this summary.
Code
# Create a custom bin list (with an overflow bin) to label categorieslabels = [str(i) if i <15else'15+'for i inrange(0, 16, 1)]chart = freq_dist_plot( data=category_trans_dist, column="# of Purchases", bin_edges=np.arange(0, 15, 1), labels=labels, x_title="Number of Category Transactions", y_title="% of Households", chart_title="Category Transaction Distribution (%)", subtitle="Distribution of category purchasing in year 1", left_closed=False, compute_rel_freq=False)chart.show()
We see that 9% of the panellists made no category purchases, 13% of the panellists made one category purchase, . . . , and 1% of the panellists made at least 15 category purchases in year 1.
4.4.2 The Relationship Between the Mean and PPB
What is the average number of times the category was purchased by a household in year 1?
Recall that mean, which we denote by \(E(X)\), is given by:
\(E(X)=\sum\frac{xf_{x}}{n}\)
where \(f_{x}\) is the frequency with which \(x\) occurs in the dataset, \(n\) is the sample size, and the summation is over all possible values of \(x\).
Denoting the relative frequency with which \(x\) occurs (i.e., \(f_{x}/n\)) by \(P(X = x)\), the mean is given by:
\(E(X)=\sum x P(X = x)\)
More formally, we can say that \(P(X = x)\) is the empirical probability that a randomly chosen household made \(x\) purchases.
Code
# Mean of the relative purchase frequencymean = category_trans_dist.select( (pl.col('# of Purchases') * pl.col('% of Total')).sum()).item(0,0)print(f'{mean =:0.3f}')
mean = 3.989
We see that average number of category purchases is 4.0. Why is this different from the 4.4 purchases per buyer (PPB) computed above? The mean we have just computed includes those households that made zero purchases, whereas PPB is the average among those households that made at least one (in this case category) purchase.
We can derive the relationship between these two quantities (mean of the relative frequency and PPB) in the following manner:
In other words, PPB is the mean divided by penetration.
Code
# All category PPB computations here are equivalentppb_category = mean / penetration.filter(pl.col('brand') =='Category').item(0,2)print(ppb_category)print(ppb.filter(pl.col('brand') =='Category').item(0,2))ppb_category = mean / (1- category_trans_dist['% of Total'][0])print(ppb_category)
Let us now create the distribution of purchase frequency for Alpha, which is. With the one exception noted below, the logic follows that associated with creating the distribution of category purchasing.
In contrast to the groupby aggregation output associated with our summary of category purchasing, this aggregation does contain a zero category. However, we must be careful in our interpretation of the associated frequency. We see that 1950 category buyers did not buy Alpha in year 1. However, in order to have a complete summary of brand purchasing, we should also account for those 447 households that made no category purchases that year. The number of panellists making zero purchases of Alpha is the total number of panellists (5021) minus the number of panellists that made 1, 2, 3, . . . purchases.
Note: What is Alpha’s penetration of category buyers? (1 − 1950/4574) = 57%. - Category Penetration = (1 - % of category buyers who did not purchase the brand)
Code
# Alpha Brand Transaction Distributionalpha_trans_dist = ( trans_summary('Alpha', grocery_lf, 1) .group_by(pl.col('# of Purchases')) .agg(pl.len().alias("Frequency")) .collect() .pipe(lambda df: df.vstack( pl.DataFrame( {"# of Purchases": [0],# Calculate zero purchases by subtracting the total sum of all purchases from the full panel size"Frequency": [panel_size - df["Frequency"].sum()], # This ensures all panelists who made zero purchases are accounted for } ).with_columns( pl.col("# of Purchases").cast(pl.UInt32), pl.col('Frequency').cast(pl.UInt32) ) ) ) .sort(by='# of Purchases') .with_columns(# Add a '% of Total' column (pl.col('Frequency') / pl.col('Frequency').sum()).alias("% of Total") ) )( GT(alpha_trans_dist, rowname_col='# of Purchases') .tab_header(title='Alpha Purchase Distribution') .tab_stubhead(label="# of Purchases") .fmt_integer(columns='Frequency') .fmt_percent(columns='% of Total') .data_color( columns=['% of Total'], domain=[0, 0.5], palette=['white', 'rebeccapurple'] ) )
Alpha Purchase Distribution
# of Purchases
Frequency
% of Total
0
2,397
47.74%
1
733
14.60%
2
517
10.30%
3
400
7.97%
4
277
5.52%
5
227
4.52%
6
144
2.87%
7
101
2.01%
8
71
1.41%
9
47
0.94%
10
28
0.56%
11
24
0.48%
12
16
0.32%
13
16
0.32%
14
12
0.24%
15
6
0.12%
16
3
0.06%
18
1
0.02%
20
1
0.02%
Code
# Create a custom bin list (with an overflow bin) to label categorieslabels = [str(i) if i <10else'10+'for i inrange(11)]chart = freq_dist_plot( data=alpha_trans_dist, column="# of Purchases", bin_edges=np.arange(0, 10, 1), labels=labels, x_title="Number of Transactions", y_title="% of Households", chart_title="Alpha Transaction Distribution (%)", subtitle="Distribution of purchase frequency for Alpha in year 1", left_closed=False, compute_rel_freq=False)chart.show()
4.5 Distribution of Spend
4.5.1 Distribution of Category Spend
We now turn our attention to creating summaries of total spend. Our initial goal is to create a histogram of category spend (in dollars) across those panellists that made at least one purchase in the category in year 1. In the following plot, the raw total spend data have been binned in bins with a width of $5.
We see that 16% of category buyers spent up to $5 in the category during this one-year period; 23% spent between $5 and $10; . . . and 2% spent more than $50.
Before deciding on what bin width to use when creating the histogram, let us first compute some basic descriptive statistics.
While this provides a good visualisation of the shape of the distribution, it can be difficult for most “consumers” of the plot to extract some additional information that may be of interest. For example, it is not easy to answer the question “What percentage of category buyers spent $30 or less in year 1?”.
There appears to be quite a bit of variability in category spend. To get further insight into the distribution of total category spend across the panellists, we need to determine the total spend quantities associate with the various percentile levels.
Code
percentiles = np.arange(5, 96, 5)# np.percentile returns the k-th percentile of values in a rangepercentile_values = np.percentile(category_spend.select('spend').collect(), percentiles)percentile_df = pl.DataFrame({'Percentile Level': percentiles,'Percentile': percentile_values})( GT(percentile_df, rowname_col='Percentile Level') .tab_header(title='Category Spend Percentile') .tab_stubhead(label="Percentile Level") .fmt_number() .data_color( columns=['Percentile'], domain=[0, 40], palette=['white', 'rebeccapurple'] ) )
Category Spend Percentile
Percentile Level
Percentile
5
2.69
10
3.39
15
4.67
20
5.88
25
6.76
30
7.63
35
9.07
40
10.06
45
11.03
50
12.57
55
13.73
60
15.24
65
16.84
70
18.79
75
20.74
80
23.43
85
26.88
90
31.81
95
39.72
We see that 5% of the category buyers spent $2.69 or less in the category during the year, 10% spent $3.39 or less, and so on. The heaviest 5% of buyers each spent more than $39.72 in the category during the year.
Looking out this output, we conclude that a bin size of $5 is probably about right.
How many bins do we go with? This is an empirical question. Since 5% of the panel spent more than $39.72, we certainly want to go above $40 in order to get a sense of how they are spread between $39.72 and the maximum of $166.70. We will go with $50. If too many panellists have spent more than $50, we can always add more bins. If too few panellists fall into this bin, we can always combine the bins we have created.
Note: The distributions of many customer behaviours have a long right tail. Accommodating the range of values can make it difficult to get a clear sense of what is happening on the left side of the distribution. It can therefore be helpful to bin the data (as with a histogram) but to right censor the data, assigning all of the observations with a value of \(𝑥\) or higher to an \(𝑥 +\) bin
We see that 750 panellists spent $5 or less in the category during year 1, 1073 spent between $5 and $10, . . . , and 109 spent more than $50.
We convert the raw counts into percentages and plot these percentages as a bar chart.
Code
# Create a custom bin list (with an overflow bin) to label categorieslabels = [f"{lb} - {lb +5if lb !=50else'∞'}"for lb inrange(-5, 55, 5)]chart = freq_dist_plot( data=category_spend, column="spend", bin_edges=np.arange(0, 51, 5), labels=labels, x_title="Category Spend ($)", y_title="% of Households", chart_title="Category Spend Distribution (%)", subtitle="Distribution of category spend across category buyers")chart.show()
The general shape of this distribution (i.e., an interior mode, median less than the mean, right-skewed with a long right-tail) is what we typically observe when we look at spend data.
This, of course, depends on the bin width we choose when summarising the data. If we had chosen a bin width of $10, the left-most bar would be the highest bar and we would no longer observe an interior mode.
4.5.2 Distribution of Brand Spend
Let us now create the distribution of spend on Alpha. We will follow the same basic process as for the distribution of category spend with a few minor changes.
Note that there are a number of 0 or NaN values in the Alpha spend column. While these panellists purchased in the category during the year, they did not purchase any of Alpha’s SKUs. The first thing we need to do is remove these observations.
Having looked at some basic descriptive stats (as above), we will summarise the data using bins of width $2, right-censoring at $40.
Code
labels = [f"{lb} - {lb +2if lb !=40else'∞'}"for lb inrange(-2, 41, 2)]chart = freq_dist_plot( data=alpha_spend, column="spend", bin_edges=np.arange(0, 41, 2), labels=labels, x_title="Brand Spend ($)", y_title="% of Households", chart_title="Alpha Spend Distribution (%)", subtitle="Distribution of spend on Alpha across brand buyers", label_angle=-45)chart.show()
The general observation made earlier about category spend distribution about the shape of the distribution of spend holds. It is not so smooth, but this is a function of the smaller bins sizes. There is, however, one obvious aberration—the large spike for $2–$4. What is going on here? The average price of most Alpha SKUs is between $2 and $4. (Review average SKU prices of Alpha) We recall from Alpha’s distribution of purchase frequency plot that 48% of the panel made zero purchases of Alpha and 15% made one transaction. This implies that 0.15/(1 − 0.48) = 29% of Alpha buyers made just one purchase of the brand. Assuming they only purchased only one unit of one Alpha SKU on that purchase occasion, we would expect a large number of Alpha buyers to spend between $2 and $4; 24% is not unrealistic.
How would you determine the validity of the assumption that the buyers who purchased Alpha just once mostly bought only one unit of Alpha SKU on that purchase occasion?
4.5.3 Analyzing Alpha’s One-Time Buyers in Year 1
Code
# One-Time Alpha Buyersone_time_alpha_buyers = trans_summary('Alpha',grocery_lf,1).filter(pl.col('# of Purchases') ==1).collect()one_time_alpha_buyers# Alpha buyers who spent between $2 to $4spend_range_alpha_buyers = spend_summary('Alpha', grocery_lf, 1).filter( (pl.col('spend') >=2) & (pl.col('spend') <4)).collect()res = one_time_alpha_buyers.with_columns( contains=pl.col('panel_id').is_in(spend_range_alpha_buyers.select('panel_id'))).select('contains').sum().item()print(f'There were {one_time_alpha_buyers.count().item(0,'panel_id')} one-time buyers of Alpha brand in Year 1.')print(f'{res} of those one-time Alpha buyers spent between [$2, $4).')print(f'Given that there are {spend_range_alpha_buyers.count().item(0,'spend')} Alpha buyers in total who spent between $2-$4,')print(f'the {res} one-time buyers represent {res / spend_range_alpha_buyers.count().item(0,'spend'):.0%} of the Alpha buyers in that spend range.')
There were 733 one-time buyers of Alpha brand in Year 1.
623 of those one-time Alpha buyers spent between [$2, $4).
Given that there are 633 Alpha buyers in total who spent between $2-$4,
the 623 one-time buyers represent 98% of the Alpha buyers in that spend range.
labels = [f"{lb} - {lb +1if lb !=12else'∞'}"for lb inrange(-1, 13, 1)]chart = freq_dist_plot( data=category_vol, column="volume", bin_edges=np.arange(0, 13, 1), labels=labels, x_title="Category Volume Purchase ($)", y_title="% of Households", chart_title="Category Volume Purchase Distribution (%)", subtitle="Distribution of category volume purchases across category buyers")chart.show()
4.6.2 Distribution of Brand Volume Purchase Frequency
Note that there are a number of 0 or NaN values in the Alpha volume purchase column. These panaellists (like the brand spend section) purchased in the category during the year, but they did not purchase any of Alpha’s products. Let us remove observations of panel_ids that have not at least made 1 purchase from Alpha brand during the year.
exceptions = ['grocery_lf', 'sku_lf', 'kiwi_lf', 'In', 'exceptions', 'active_variables','penetration', 'ppb', 'panel_size']active_variables = [ var for var, value inglobals().items()ifnot var.startswith('_') # Exclude variables that start with "_"and var notin exceptions # Exclude variables in the exceptions listandisinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only]for var in active_variables:delglobals()[var]del active_variables, exceptions, vargc.collect()
127
4.7 Performing a Basic Decile Analysis
We have observed that there is a lot of variability in purchase frequency and spend, be it at the product or category level. A useful way of expressing the variability is via a decile analysis. As the name suggests, this sees us dividing the customer base into 10 equally sized groups and then summarising each group’s buying behaviour. We will consider two versions of this analysis.
4.7.1 Decile analysis of category buying behaviour - Decile based on category buyers
We will first focus on creating the table below. The table captures the decile analysis of category buying behaviour (where each decile equals 10% category buyers). Recall that 4574 households made at least one category purchase in year 1. Having sorted these households by total category spend, we create 10 equally sized groups. Decile 1 is the 10% of households that spent the most in the category during year 1, decile 2 is the next largest 10% of spenders, and so on.
Code
# Create decile cuts in the group_spend data using the sorted and ranked data# Part 1: Break-tie ranking of total category spend data# Part 2: Decile cuts in rankingbuyer_decile = spend_summary('Category', grocery_lf, 1).with_columns( pl.col('spend').rank(method='ordinal', descending=True).alias('rank'),).with_columns( (np.floor(10* (pl.col('rank') -1) / pl.col('rank').max()) +1).cast(pl.UInt16).alias('decile')).sort('decile', 'rank').collect()buyer_decile
Reading across the row associated with Decile 1, we see that they accounted for 28% of category spend and 24% of total transactions. On average, they spent $44.65 in the category across an average of 10.6 purchase occasions. This corresponds to an average category spend per category transaction of $4.21. On average, these households purchased 1.8 different brands during the year. Contrast this to the 10% of category buyers that spent the least in the category. They account for 2% of category spend and transactions, making on average one category purchase. And so on.
This is a very basic decile table. Some additional information that could be reported includes the average number of units purchased per transaction and the average number of unique SKUs purchased in the year.
4.7.2 Decile analysis of category buying behaviour - Decile based on category spend/revenue
The decile analysis we have just completed uses deciles that represent 10% of the category buyers. An alternative approach is to create deciles that represent 10% of category spend. In other words, the decile analysis took the entire customer base that purchased the category in year 1 and split the base into ten equal group. In this analysis, we will take the total customer spend or total revenue and split that revenue into ten equal groups. The only change to what we have done above is how we create the decile variable.
Code
# Divide the total revenue into 10 equal parts and generate the decile binsspend_total = spend_summary('Category', grocery_lf, 1).select(pl.col('spend')).sum().collect().to_series()[0]breaks = np.arange(0, spend_total, spend_total/10)labels = [f'{i}'for i inrange(0, 11, 1)]spend_decile = spend_summary('Category', grocery_lf, 1).sort('spend', descending=True).with_columns( pl.col('spend').cum_sum().alias('cum sum') # Create a cumulative sum column for category spend (category revenue)).with_columns( pl.col('cum sum').cut( # Cut function determines the decile range for each cumulative sum value breaks=breaks, labels=labels # Function also adds the decile label from 1 to 10 ).cast(pl.UInt16).alias('decile')).sort('decile').collect()spend_decile
Reading across the row associated with Decile 1, we see that the top 2% of households accounted for 10% of category spend. On average, they spent $63.77 in the category, across an average of 13.4 purchase occasions. This corresponds to an average category spend per category transaction of $4.74. On average, these households purchased 1.9 different brands during the year.
A Lorenz curve is a common graphical tool for visualising “concentration” or “inequality” in the distribution of a quantity of interest (e.g., income, buying behaviour). It show the proportion of the overall quantity (e.g., number of transactions, spend) associated with the bottom x% of the unit of observation associated with the distribution (e.g., households). When analysing buyer behaviour, this sees us lining up all customers in ascending order of their level of purchasing and computing the share of total purchasing accounted by each person. The Lorenz curve is created by plotting the cumulative percentage of customers (x-axis) against the cumulative percentage of total purchasing (y-axis).
The Lorenz curve for transactions associated with Alpha is computed and plotted below and interpreted in the following manner. We see that that x = 80% roughly corresponds to y = 54%, which means the 80% of the buyers of Alpha (when sorted from least to most frequent buyers) account for 54% of all the buying of Alpha in year 1. This implies the top 20% of buyers (in terms of purchase frequency) account for 46% of total purchases. The “rule” of 80/20 does not hold here; rather, it is 46/20.
Code
# We are only interested in those panellists that made at least one purchase of Alpha in year 1.# Alpha Brand Transaction Distribution of Transactions Greater Than 0alpha_trans_dist = ( trans_summary('Alpha', grocery_lf, 1) .filter(pl.col('# of Purchases') >0) .group_by('# of Purchases')# Frequency of the Grouped Transaction Count .agg(pl.col('# of Purchases').count().alias('Frequency')) .sort('# of Purchases')# Add a '% of Buyers' column .with_columns((pl.col('Frequency')/pl.col('Frequency').sum()).alias('% of Buyers'))# Cumulative % of Buyers .with_columns((pl.col('% of Buyers').cum_sum()).alias('Cum % of Buyers'))# Total number of transactions made by those customers who made one purchase of Alpha# Total Transactions = frequency x # of transactions .with_columns((pl.col('Frequency')*pl.col('# of Purchases')).alias('Total Transactions'))# Add a '% of Transactions' column .with_columns((pl.col('Total Transactions')/pl.col('Total Transactions').sum()).alias('% Transactions'))# Cumulative % of Transactions .with_columns((pl.col('% Transactions').cum_sum()).alias('Cum % Transactions')))( GT(alpha_trans_dist.collect(), rowname_col='# of Purchases') .tab_header(title='Alpha Buyers & Transaction Distribution', subtitle='Lorenz Curve Table') .tab_stubhead(label="# of Purchases") .fmt_percent(columns=['% of Buyers', 'Cum % of Buyers', '% Transactions', 'Cum % Transactions'], decimals=1) .fmt_integer(columns=['Frequency', 'Total Transactions']))
Alpha Buyers & Transaction Distribution
Lorenz Curve Table
# of Purchases
Frequency
% of Buyers
Cum % of Buyers
Total Transactions
% Transactions
Cum % Transactions
1
733
27.9%
27.9%
733
8.1%
8.1%
2
517
19.7%
47.6%
1,034
11.4%
19.5%
3
400
15.2%
62.9%
1,200
13.2%
32.7%
4
277
10.6%
73.4%
1,108
12.2%
45.0%
5
227
8.7%
82.1%
1,135
12.5%
57.5%
6
144
5.5%
87.6%
864
9.5%
67.0%
7
101
3.8%
91.4%
707
7.8%
74.8%
8
71
2.7%
94.1%
568
6.3%
81.1%
9
47
1.8%
95.9%
423
4.7%
85.8%
10
28
1.1%
97.0%
280
3.1%
88.9%
11
24
0.9%
97.9%
264
2.9%
91.8%
12
16
0.6%
98.5%
192
2.1%
93.9%
13
16
0.6%
99.1%
208
2.3%
96.2%
14
12
0.5%
99.6%
168
1.9%
98.1%
15
6
0.2%
99.8%
90
1.0%
99.1%
16
3
0.1%
99.9%
48
0.5%
99.6%
18
1
0.0%
100.0%
18
0.2%
99.8%
20
1
0.0%
100.0%
20
0.2%
100.0%
We are only interested in those panellists that made at least one purchase of Alpha in year 1.
We compute the total number of transactions made by those customers who made one purchase of Alpha: this is obviously the number of transactions times the frequency of the transactions. We see that the 517 households that purchased Alpha twice made a total of 1034 transactions.
What percentage of the total number of buyers of Alpha are the 733 households that made one purchase? The 517 households that made two purchases? What percentage of total purchasing is the 733 purchases made by those households that made one purchase? The 1034 purchases by made by those households that made two purchases? We first compute the total number of Alpha buyers and the total amount of purchasing by this group, and then compute the associated percentages. We see that 28% of Alpha buyers made just one purchase and their purchasing accounted for 8% of all Alpha purchasing. Similarly, we see that 20% of buyers made two purchases and their purchasing accounted for 11% of all Alpha purchasing.
We compute the cumulative percentage of both % of buyers and % of transactions quantities We see, for example, that 48% of all buyers made two or fewer purchases and accounted for 20% of all the purchases of Alpha.
We see that 73.4% (\(x_{0}\)) of buyers accounted for 45.0% (\(y_{0}\)) of purchases, and that 82.1% (\(x_{1}\)) of buyers accounted for 57.5% (\(y_{1}\)) of purchases. Therefore, when \(x\) = 80%, \(y\) can be solved in the following way:
we see that 80% falls 76% of the way between 73.4% and 82.1%. Therefore, the associated percentage of total purchases will lie 76% of the way between 45.0% and 57.5%.
As the “bottom” 80% account for 54% of total purchasing, it follows that the “top” 20% account for 46% of total purchasing.
Code
# Numpy linear interpolation function simplifies the computation, return the interpolated y valuenp.interp(0.8, xp=alpha_lorenz_plot['Cum % of Buyers'], fp=alpha_lorenz_plot['Cum % Transactions'])
np.float64(0.5448123620309051)
A related quantity of interest is the percentage of buyers that account for half of total purchasing. This can be read off the Lorenz curve in the following manner.
We see that 73.4% of purchasers account for 45.0% of total purchases, and that 82.1% of purchasers account for 57.5% of total purchases.
We can apply the same linear interpolation function but instead of solving for the \(y\), we solve for \(x\).
We see that 50% lies 40% of the way between 44.0% and 57.5%. Therefore, the associated percentage of purchasers lies 40% of the way between 73.4% and 82.1%, we see this is 76.9%.
As the “bottom” 77% of buyers account for half of total purchasing, it follows that the “top” 23% also account for half of total purchasing, which we can write as 50/23. This quantity is easy for most people grasp.
Code
# Reversing the axis returns the interpolated x valuenp.interp(0.5, xp=alpha_lorenz_plot['Cum % Transactions'], fp=alpha_lorenz_plot['Cum % of Buyers'])
np.float64(0.7690548780487805)
4.8.2 Lorenz Curves - Spend - Continuous
Purchase frequency is a discrete quantity and we created the Lorenz curve off the distribution of transactions. We now consider how to create a Lorenz curve when the quantity of interest is continuous. We will focus on creating the Lorenz curve for the spend associated with Alpha.
Code
alpha_spend_dist = ( spend_summary('Alpha', grocery_lf, 1) .filter(pl.col('spend') >0)# Sort Alpha customer from lowest to highest .sort(pl.col('spend'), descending=False)# Compute spend as a % of total Alpha brand spend in year 1# percentage of total Alpha purchasing accounted for by each panellist .with_columns((pl.col('spend')/pl.col('spend').sum()).alias('% Spend'))# Compute the cumulative percentage of spend numbers .with_columns((pl.col('% Spend').cum_sum()).cast(pl.Float64).alias('Cum % Spend'))).collect()# Compute the cumulative percentage of buyers numberspend_count = alpha_spend_dist.heightalpha_spend_dist = alpha_spend_dist.with_columns( pl.lit(np.linspace(1/ spend_count, 1, spend_count)).alias('Cum % Cust'))alpha_spend_dist
Looking at alpha_spend_dist, we see that the “bottom” 80% of Alpha buyers account for 50.8% of its revenue in year 1, which is equivalent to 49/20. This is higher than the 46/20 we observed for transactions.
exceptions = ['grocery_lf', 'sku_lf', 'kiwi_lf', 'In', 'exceptions', 'active_variables','penetration', 'ppb', 'panel_size']active_variables = [ var for var, value inglobals().items()ifnot var.startswith('_') # Exclude variables that start with "_"and var notin exceptions # Exclude variables in the exceptions listandisinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only]for var in active_variables:delglobals()[var]del active_variables, exceptions, vargc.collect()
74
5 Exploring Multibrand Buying Behaviour
5.1 Introduction
So far, we have explored purchasing at the level of the individual brand or overall category. We now step back and consider panellists’ purchasing of multiple brands (in a given time period).
To set the scene, let us first determine the percentage of category buyers that bought 1, 2, 3, 4, or 5 different brands in the course of year 1. (Note that we only have five brands in the dataset, with Other being an aggregation of several very small share brands).
Code
brand_purchases = ( trans_summary('Category', grocery_lf, 1) .group_by('Brands Purchased') .agg(pl.col('panel_id').count().alias('Category Buyer')) .with_columns((pl.col('Category Buyer') / pl.col('Category Buyer').sum()).alias('% Category Buyers')) .sort('Brands Purchased')).collect()( GT(brand_purchases, rowname_col='Brands Purchased') .tab_header(title='Category Buyers That Bought Different Brands', subtitle='Distribution of the number of separate brands purchased by category buyers in year 1') .tab_stubhead(label="Brands Purchased") .fmt_integer(columns=['Category Buyer']) .fmt_percent(columns=['% Category Buyers'], decimals=0) .data_color( columns=['% Category Buyers'], domain=[0, 0.7], palette=['white', 'rebeccapurple'] ) )
Category Buyers That Bought Different Brands
Distribution of the number of separate brands purchased by category buyers in year 1
Brands Purchased
Category Buyer
% Category Buyers
1
2,964
65%
2
1,278
28%
3
293
6%
4
39
1%
We note that over two-thirds of the category buyers only ever bought one brand. This is despite the fact that 86% of category buyers made two or more category purchases in the course of the year. No household purchased all five brands that year.
Note: We know from analyzing the distribution of category purchase frequency that 8.9% of the panel made zero category purchases and 13.2% made one category purchase. 1 − 0.132/(1 − 0.089) = 0.855, which means 86% of year 1 category buyers made more than one category purchase that year.
Can also be computed as: (1 - 0.132 - 0.089)/(1 - 0.089) [% of panellists that more than 1 purchase] / [% of category buyers]
How does the number of different brands purchased in the year vary as a function of the number of category purchases made during the year?
Code
# number of category purchases made by each household and number of brands purchased by each householdbrand_loyalty = ( trans_summary('Category', grocery_lf, 1) .group_by('# of Purchases', 'Brands Purchased') .agg(pl.col('panel_id').count().alias('Count')) .collect() .pivot( on='Brands Purchased', index='# of Purchases', values='Count' ).sort(pl.col('# of Purchases')) .with_columns(pl.col('*')) .select('# of Purchases', '1', '2', '3', '4'))( GT(brand_loyalty, rowname_col='# of Purchases') .tab_header(title='# of Purchase Occasions Vs. Brands Purchased ', subtitle='Brands purchased as function of the number of category purchases made in year 1') .tab_stubhead(label="# of Purchases") .sub_missing(missing_text="") .fmt_integer() .data_color( columns=['1', '2', '3', '4'], domain=[0, 700], palette=['white', 'rebeccapurple'], na_color='white' ) )
# of Purchase Occasions Vs. Brands Purchased
Brands purchased as function of the number of category purchases made in year 1
# of Purchases
1
2
3
4
1
655
12
2
573
184
2
3
516
216
24
1
4
370
193
47
2
5
278
189
43
4
6
196
124
53
8
7
121
93
28
5
8
87
94
25
3
9
51
51
22
3
10
42
34
9
11
24
24
6
3
12
16
21
9
4
13
11
9
6
1
14
13
9
8
1
15
4
8
3
16
3
6
4
3
17
1
3
3
18
1
3
19
2
20
2
1
1
22
1
25
1
27
1
Looking at the resulting table output, we see that there is quite a high level of sole-brand loyalty (i.e., only buying one brand), even as the number of category purchases increases. We have two panellists that made 20 category purchases, all with the same brand. That’s some level of loyalty!
We can compute the (weighted) average number of different brands purchased for each level of category purchasing.
Code
wa_brands_purchased = ( trans_summary('Category', grocery_lf, 1) .group_by('# of Purchases', 'Brands Purchased') .agg(pl.col('panel_id').count().alias('Count')) .with_columns((pl.col('Brands Purchased') * pl.col('Count')).alias('Product')) .group_by('# of Purchases') .agg( (pl.col('Product').sum() / pl.col('Count').sum()) .alias('Weighted Average') ).sort(by='# of Purchases')).collect()( GT(wa_brands_purchased, rowname_col='# of Purchases') .tab_header(title='W.A. # of Different Brands Purchased') .tab_stubhead(label="# of Purchases") .fmt_number() .data_color( columns=['Weighted Average'], domain=[1, 4], palette=['white', 'rebeccapurple'] ) .cols_label({'Weighted Average': 'Weighted Average Brands'}) )
W.A. # of Different Brands Purchased
# of Purchases
Weighted Average Brands
1
1.02
2
1.25
3
1.35
4
1.48
5
1.56
6
1.67
7
1.66
8
1.73
9
1.82
10
1.61
11
1.79
12
2.02
13
1.89
14
1.90
15
1.93
16
2.44
17
2.29
18
1.75
19
2.00
20
1.75
22
2.00
25
2.00
27
4.00
We note that the average number of brands purchased does increase as a function of category purchasing. This should not come as a surprise, as more category purchases equals more opportunities to buy different brands. We saw in earlier that two brands (Alpha and Bravo) had a combined value market share of 86%. As such, the relatively low number of different brands purchased in the year is not too surprising.
We now consider three common analyses designed to give insight into the nature of multibrand buying behaviour.
5.2 Duplication of Purchase
We know from our analysis of penetration and PPB earlier that 52% of households purchased Alpha and 51% of households purchased Bravo. What percentage of Alpha buyers also purchased Bravo during the year (and vice-versa)? The answer to these (and similar) questions is provided by a duplication of purchase analysis.
Before performing this analysis on our dataset, let us consider a toy example. The following table summarizes the purchasing of four brands by six households.
A
B
C
D
HH01
1
0
2
0
HH02
0
1
0
0
HH03
1
3
0
0
HH04
0
0
1
4
HH05
1
1
0
1
HH06
0
0
0
1
We see that three households made at least one purchase of brand A, three households made at least one purchase of brand B, and so on. How many brand A buyers also purchased brand B? Two (HH03 and HH05). How many brand A buyers also purchased brand C? One (HH01). Repeating this for all brands gives us the following table, which we will call a duplication count table.
A
B
C
D
A
3
2
1
1
B
2
3
0
1
C
1
0
2
1
D
1
1
1
3
For any given row, the number in each cell is the number of buyers of the brand associated with that row that also purchased the brand associated with that column. (The diagonal is obviously the number of buyers of each brand.) Looking at the row for brand A, we see that three households purchased that brand. Two of these households (67%) also made at least one purchase of brand B, and one of these three households (33%) also made at least one purchase of brand C. These row percentages are reported in the following table, which we call the duplication of purchase table. (By convention, we leave the diagonal blank.)
A
B
C
D
A
67%
33%
33%
B
67%
0%
33%
C
50%
0%
50%
D
33%
33%
33%
How can we create this table efficiently when we have a large number of panellists? One approach, which makes use of matrix multiplication, is as follows: - We create what we will call an “ever buyers” matrix (or incidence matrix), which is a matrix of size (number of panellists) × (number of brands), where each cell takes on a value of 1 if the panellist (row) ever purchased the brand (column) in the period of interest; 0 otherwise.
A
B
C
D
HH01
1
0
1
0
HH02
0
1
0
0
HH03
1
1
0
0
HH04
0
0
1
1
HH05
1
1
0
1
HH06
0
0
0
1
Pre-multiplying the “ever buyers” matrix by its own transpose gives us the duplication count table which we created above: \(A^{T}\cdot A=B\). (Evaluate the example below.)
Transposed “ever buyer” matrix:
HH01
HH02
HH03
HH04
HH05
HH06
A
1
0
1
0
1
0
B
0
1
1
0
1
0
C
1
0
0
1
0
0
D
0
0
0
1
1
1
Matrix Multiply By:
A
B
C
D
HH01
1
0
1
0
HH02
0
1
0
0
HH03
1
1
0
0
HH04
0
0
1
1
HH05
1
1
0
1
HH06
0
0
0
1
The ‘ever buyer’ matrix (incidence matrix) effectively creates a binary of 1 (Yes) or 0 (No) to factor in the required inclusions or exclusions of intersection between buyers of different brands. In this case, when we multiply the binary matrix and its transposed form together, we are computing a matrix that records the count of existing intersection of buyers of a brand who also purchased other brands.
Dividing each cell by the number of buyers of the brand associated with each row gives us the duplication of purchase table.
# m x n matrix where m = # of panellists, n = # of brands and value is number of purchasesbrand_purch = np.array([ [1, 0, 2, 0], [0, 1, 0, 0], [1, 3, 0, 0], [0, 0, 1, 4], [1, 1, 0, 1], [0, 0, 0, 1]])# m x n matrix where value is 1 if buyer ever purchased, 0 if never purchasedever_buyers = np.where(brand_purch !=0, 1, 0)# n x n square matrix representing count of row/column brand buyers who also purchased column/row brandsdup_count = ever_buyers.T @ ever_buyersprint('Duplication Count Matrix:')print(dup_count, dup_count.shape)print()# Diagonal vector of dup_count matrix representing the number of buyers of each ROW brandbrand_buyers = np.diag(dup_count).reshape(4, 1)print('Number of Buyers of Each Brand:')print(brand_buyers, brand_buyers.shape)# n x n square matrix representing the proportion of brand buyers for each ROW brand that also purchased other COLUMN brandsdup_purchase = dup_count / brand_buyersprint('Duplication of Purchase Matrix:')print(dup_purchase, dup_purchase.shape)
( GT(dup_count_df, rowname_col='Brands') .tab_header(title='Duplication Count Matrix', subtitle='Count of buyers who purchased row AND column brands') .tab_stubhead(label="Brands") .sub_missing(missing_text="") .fmt_integer() .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[0, 900], palette=['white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('lightgrey'), ], locations=[ loc.body(['Alpha'],['Alpha']), loc.body(['Bravo'],['Bravo']), loc.body(['Charlie'],['Charlie']), loc.body(['Delta'],['Delta']), loc.body(['Other'],['Other']) ] ) .tab_source_note(md("""**Read Row/Column as**: _# of People who Purchased [Brand] / Who also Purchased [Brand]_""")))
Duplication Count Matrix
Count of buyers who purchased row AND column brands
Brands
Alpha
Bravo
Charlie
Delta
Other
Alpha
2,624
896
403
239
69
Bravo
896
2,562
382
140
110
Charlie
403
382
813
116
26
Delta
239
140
116
380
10
Other
69
110
26
10
176
Read Row/Column as: # of People who Purchased [Brand] / Who also Purchased [Brand]
Code
( GT(dup_purchase_df, rowname_col='Brands') .tab_header(title='Duplication of Purchase Matrix', subtitle='% of brand buyers that also purchased other brands') .tab_stubhead(label="Brands") .sub_missing(missing_text="") .fmt_percent(decimals=0) .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[0, 0.7], palette=['white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('white'), ], locations=[ loc.body(0,0), loc.body(1,1), loc.body(2,2), loc.body(3,3), loc.body(4,4) ] ) .tab_source_note(md("""**Read Row/Column as**: _% of People who Purchased [Brand] / Who also Purchased [Brand]_""")))
Duplication of Purchase Matrix
% of brand buyers that also purchased other brands
Brands
Alpha
Bravo
Charlie
Delta
Other
Alpha
100%
34%
15%
9%
3%
Bravo
35%
100%
15%
5%
4%
Charlie
50%
47%
100%
14%
3%
Delta
63%
37%
31%
100%
3%
Other
39%
62%
15%
6%
100%
Read Row/Column as: % of People who Purchased [Brand] / Who also Purchased [Brand]
The duplication of purchase matrix for year 1 is computed above. We see that 34% of those panellists that purchased Alpha in year 1 also made at least one purchase of Bravo that year. We see that 15% of Alpha buyers also bought Charlie, whereas 50% of Charlie buyers also bought Alpha. This asymmetry is not surprising given the relative size of the two brands.
5.3 Share of Category Requirements (SCR)
The duplication of purchase table tells use that 34% of Alpha buyers also purchased Bravo, 15% also purchased Charlie, and so on. This “polygamous purchasing” leads to an obvious question: How “loyal” are they to Alpha? The answer to this question obviously depends on what we mean by “loyal”. One commonly used measure of loyalty is share of category requirements (SCR), which is the percentage of category volume that the brand represents among its buyers.
The SCR numbers for each brand are reported below. We see that Alpha has a share of category requirements of 69%. This means that 69% of the total category volume purchased by buyers of Alpha goes to Alpha. Contrast this to Delta, which has an SCR of 40%. This means that 40% of the volume purchased in the category by buyers of Delta goes to that brand.
Code
# List of brandsbrands = ['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other']# Initialize results listscr_results = []with pl.StringCache():# Compute total category volume total_vol = vol_summary('Category', grocery_lf, 1)for brand in brands:# Join brand-specific volume to total category volume to get total category volume for panelists who purchased the brand scr_lf = ( total_vol.join( other=vol_summary(brand, grocery_lf, 1), on='panel_id', how='inner' ) .with_columns(# Compute total brand and category volumes# Total purchasing of each brand (pl.col('volume_right').sum()).alias('Brand Volume'),# The total amount of category purchasing conditioned on the fact that at least one purchase of the brand of interest was made. (pl.col('volume').sum()).alias('Category Volume'), ) .with_columns(# Compute SCR - ratio of brand purchasing to category purchasing (pl.col('Brand Volume').sum() / pl.col('Category Volume').sum()).alias('SCR') ) .select('Brand Volume', 'Category Volume', 'SCR') .collect() )# Append results scr_results.append({"Brands": brand,"Brand Purchasing": scr_lf.item(1, "Brand Volume"),"Category Purchasing": scr_lf.item(1, "Category Volume"),"SCR": scr_lf.item(1, "SCR") })# Create final DataFramescr = pl.DataFrame(scr_results)
Code
( GT(scr, rowname_col='Brands') .tab_header(title='Share of Category Requirements (SCR)', subtitle='% of category volume that the brand represents among its buyers') .tab_stubhead(label="Brands") .fmt_integer(columns=['Brand Purchasing', 'Category Purchasing']) .fmt_percent(columns='SCR', decimals=0) .data_color( domain=[0, 0.7], palette=['white', 'rebeccapurple'], columns='SCR' ))
Share of Category Requirements (SCR)
% of category volume that the brand represents among its buyers
Brands
Brand Purchasing
Category Purchasing
SCR
Alpha
9,166
13,315
69%
Bravo
8,240
12,121
68%
Charlie
2,171
4,780
45%
Delta
921
2,277
40%
Other
286
980
29%
Code
# Method 2brand_vol = vol_pivot(grocery_lf, 1).fill_null(0)ever_buyers = brand_vol.select( [pl.when(pl.col(col) >0).then(1).otherwise(0).alias(col) for col in brand_vol.columns]).to_numpy()brand_vol = brand_vol.to_numpy()cross_purchasing = ever_buyers.T @ brand_volbrand_purchasing = np.diag(cross_purchasing)category_purchasing = np.sum(cross_purchasing, axis=1)scr = brand_purchasing / category_purchasingscr
We have just seen that 69% of total category volume purchasing by the buyers of Alpha went to that brand. We know from the duplication of purchase analysis that 15% of Alpha buyers also purchased Charlie. How much of their category volume purchasing went to Charlie? This is answered via cross purchase analysis (sometimes called a combination purchase analysis).
To illustrate the logic of the associated calculations, let us revisit the toy problem introduced earlier. We have the following summary of the purchasing of four brands by six households and the associated “ever buyers” matrix. We will assume the “ever buyers” matrix reports volume purchasing in kilograms.
Brand Purchasing
A
B
C
D
Total
HH01
1
0
2
0
3
HH02
0
1
0
0
1
HH03
1
3
0
0
4
HH04
0
0
1
4
5
HH05
1
1
0
1
3
HH06
0
0
0
1
1
Ever Buyers Matrix (Incidence Matrix)
A
B
C
D
HH01
1
0
1
0
HH02
0
1
0
0
HH03
1
1
0
0
HH04
0
0
1
1
HH05
1
1
0
1
HH06
0
0
0
1
We see that buyers of brand A purchased 3 kg (1 + 1 + 1, from A column) of brand A and a total of 10 kg (3 + 4 + 3, from total column) in the category (i.e., SCR = 30%). We see that they also purchased 4 kg (3 + 1, from B column) of brand B, 2 kg (2, from C column) of brand C and 1 kg (1, from D column) of brand D. The associated numbers for all brands are given in the following table:
A
B
C
D
A
3
4
2
1
B
2
5
0
1
C
1
0
3
4
D
1
1
1
6
The sum of the elements of each row gives us the total amount of category purchasing by buyers of the brand of that row. Dividing each row entry by the sum of that row’s elements gives us the following cross purchasing table, the diagonal of which is obviously SCR.
A
B
C
D
A
30%
40%
20%
10%
B
25%
63%
0%
13%
C
13%
0%
38%
50%
D
11%
11%
11%
67%
How do we create the table efficiently when we have a large number of panellists? One approach is to pre-multiply the panellist × brand volume purchasing summary table by the transpose of the “ever buyers” matrix \(A^{T}\cdot B\):
Transposed “ever buyer” matrix:
HH01
HH02
HH03
HH04
HH05
HH06
A
1
0
1
0
1
0
B
0
1
1
0
1
0
C
1
0
0
1
0
0
D
0
0
0
1
1
1
Multiplied by Brand Volume Purchasing:
A
B
C
D
HH01
1
0
2
0
HH02
0
1
0
0
HH03
1
3
0
0
HH04
0
0
1
4
HH05
1
1
0
1
HH06
0
0
0
1
The ‘ever buyer’ matrix creates a binary of 1 (buyer) or 0 (non-buyer) to help factor in the inclusions/exclusions and overlaps between brands purchased by all buyers. In this case, as we multiply the transposed form of the binary matrix with the quantity/volume matrix representing the quantity of purchase of different brands by each buyer, we are solving for a matrix that represents the total volume of purchasing for a set of brands by a set of buyers and their purchases of other brands.
Code
# m x n matrix where m = # of panellists, n = # of brands and value is volume of purchases brand_purch = np.array([ [1, 0, 2, 0], [0, 1, 0, 0], [1, 3, 0, 0], [0, 0, 1, 4], [1, 1, 0, 1], [0, 0, 0, 1]])# m x n matrix where value is 1 if buyer ever purchased, 0 if never purchasedever_buyers = np.where(brand_purch !=0, 1, 0)# n x n square matrix representing row brand buyers and their volume/unit purchases from row brand and column brandscross_purch_matrix = ever_buyers.T @ brand_purchprint('Cross Purchasing Matrix:')print(cross_purch_matrix, cross_purch_matrix.shape)print()# The sum of the elements of each row gives us the total amount of category purchasing (volume) by buyers of the brand of that rowcategory_purch = cross_purch_matrix.sum(axis=1).reshape(4, 1)print('Total Amount of Category Purchasing by Buyers of the Row Brand:')print(brand_buyers, brand_buyers.shape)print()# n x n square matrix representing the proportion of volume purchased by row brand buyers for each COLUMN brands relative to total purchased of the row brand buyercross_purch = cross_purch_matrix / category_purchprint('Cross Purchasing Matrix as a % of Total Volume Purchased of Each Brand:')print(cross_purch, cross_purch.shape)
Cross Purchasing Matrix:
[[3 4 2 1]
[2 5 0 1]
[1 0 3 4]
[1 1 1 6]] (4, 4)
Total Amount of Category Purchasing by Buyers of the Row Brand:
[[2624]
[2562]
[ 813]
[ 380]
[ 176]] (5, 1)
Cross Purchasing Matrix as a % of Total Volume Purchased of Each Brand:
[[0.3 0.4 0.2 0.1 ]
[0.25 0.625 0. 0.125 ]
[0.125 0. 0.375 0.5 ]
[0.11111111 0.11111111 0.11111111 0.66666667]] (4, 4)
The cross purchasing analysis for year 1 is computed and reported below. We see that for those panellists that purchased Alpha at least once in year 1, 69% of their category volume purchased went to Alpha, 18% went to Bravo, 8% to Charlie, and so on.
Code
brand_vol_purchases = vol_pivot(grocery_lf, 1).fill_null(0)# Step 1: Create ever buyers matrixever_buyers = brand_vol_purchases.select( [pl.when(pl.col(col) >0).then(1).otherwise(0).alias(col) for col in brand_purchases.columns]).to_numpy()# Step 2: Compute cross purchase matrixcross_purch_matrix = ever_buyers.T @ brand_vol_purchasestotal_volume_purchases = np.sum(cross_purch_matrix, axis=1).reshape(5, 1)# Step 3: Compute cross purchase matrix as a % of total volume purchased of each brandcross_purch = cross_purch_matrix / total_volume_purchases# DataFrame conversion for tabular viewcross_purch_matrix_df = ( pl.from_numpy(cross_purch_matrix, schema=brand_vol_purchases.columns) .with_columns(pl.Series('Brands', brand_vol_purchases.columns)))cross_purch_df = ( pl.from_numpy(cross_purch, schema=brand_vol_purchases.columns) .with_columns(pl.Series('Brands', brand_vol_purchases.columns)))
Code
( GT(cross_purch_matrix_df, rowname_col='Brands') .tab_header(title='Cross Purchasing Analysis', subtitle='Total Volume of Category Purchasing by Buyers of the Brand (by row)') .tab_stubhead(label="Brands") .sub_missing(missing_text="") .fmt_integer() .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[0, 3000], palette=['white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('lightgrey'), ], locations=[ loc.body(0,0), loc.body(1,1), loc.body(2,2), loc.body(3,3), loc.body(4,4) ] ) .tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / Total Volume Purchased of [Brands]_""")))
Cross Purchasing Analysis
Total Volume of Category Purchasing by Buyers of the Brand (by row)
Brands
Alpha
Bravo
Charlie
Delta
Other
Alpha
9,166
2,461
1,076
504
108
Bravo
2,522
8,240
907
282
170
Charlie
1,221
1,131
2,171
220
36
Delta
701
311
332
921
13
Other
223
398
53
20
286
Read Row/Column as: Purchasers of [Brands] / Total Volume Purchased of [Brands]
Code
( GT(cross_purch_df, rowname_col='Brands') .tab_header(title='Cross Purchasing Analysis', subtitle='% of Total Volume of Category Purchasing by Buyers of the Brand (by row)') .tab_stubhead(label="Brands") .sub_missing(missing_text="") .fmt_percent(decimals=0) .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[0, 0.7], palette=['white', 'rebeccapurple'], ) .tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / % Total Volume Purchased of [Brands]_""")))
Cross Purchasing Analysis
% of Total Volume of Category Purchasing by Buyers of the Brand (by row)
Brands
Alpha
Bravo
Charlie
Delta
Other
Alpha
69%
18%
8%
4%
1%
Bravo
21%
68%
7%
2%
1%
Charlie
26%
24%
45%
5%
1%
Delta
31%
14%
15%
40%
1%
Other
23%
41%
5%
2%
29%
Read Row/Column as: Purchasers of [Brands] / % Total Volume Purchased of [Brands]
5.4.2 Importance of Competition Plot - By Volume Purchase
For a given brand, we can plot the associated row entries as a pie chart, for example, see the Importance of competition plot, which plots the percentage of total volume of category purchasing by buyers of Alpha.
Code
cross_purch_plot = ( pl.from_numpy(cross_purch.T, schema=brand_vol_purchases.columns, orient='row') .with_columns(pl.Series('Brands', brand_vol_purchases.columns)))base = alt.Chart(cross_purch_plot).encode( alt.Theta("Alpha:Q").stack(True), alt.Color("Brands:N")).properties( width=600, title={'text': 'Importance of Competition to Buyers of Alpha','subtitle': '% of Total Volume of Category Purchasing by Buyers of Alpha'},)pie = base.mark_arc(outerRadius=145)text = base.mark_text(radius=165, size=15).encode( text=alt.Text("Alpha:N", format=".0%"))pie + text
5.4.3 Importance Against Expectation - By Volume Purchase
As previously noted when we computed SCR, we see that Alpha accounts for 69% of category purchasing by the buyers of Alpha. We see from the cross purchase analysis see that Bravo accounts for 18% of their category purchasing. Is this large or small? One way of answering this question is to compare actual purchasing against expectation given general purchasing patterns in the category.
Code
# Share of Category Purchasing - Alpha Brand Selectedshare_of_cat_purch = cross_purch[0]# Market Share - Volumemarket_share = ( brand_vol_purchases.sum() /# sum all columns -> returns a single row dataframe brand_vol_purchases.sum_horizontal().sum() # sum of total category volume sales -> retruns a scalar value).to_numpy().reshape(5)# Share of residual purchasing -- Alpha Buyers# The percentage of the category purchasing not accounted for by Alpha that goes to each of the other brandsresidual_purch_brand = share_of_cat_purch / (1- share_of_cat_purch[0])# Share of residual purchasing -- Category Buyers# Residual share of category purchasing (across all category buyers) once Alpha is removed# when we exclude Alpha, what percentage of (the remaining) category purchasing goes to each of the other brands)residual_purch_category = market_share / (1- market_share[0])# Index against expectationindex_against_expect =100* residual_purch_brand/residual_purch_category# Importance Against Expectations - Create DataFramestack = np.vstack(( share_of_cat_purch, market_share, residual_purch_brand, residual_purch_category, index_against_expect))cols = pl.DataFrame({'Col': ['Share of Category Purchasing', 'Volume Market Share', 'Share of Residual Purchasing - Alpha Buyers', 'Share of Residual Purchasing - Category Buyers','Index Against Expectation']})rows = cross_purch_df.columns[:-1]importance_against_expect = pl.from_numpy(stack, schema=rows).hstack(columns=cols)
Code
( GT(importance_against_expect, rowname_col='Col') .tab_header(title='Importance Against Expectation', subtitle='Compare Actual Purchasing against Expectation Given General Purchasing Patterns') .fmt_percent(rows=list(range(4)), decimals=0) .fmt_integer(rows=4) .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], rows=4, domain=[80, 170], palette=['white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('white'), ], locations=[loc.body(0,[2, 3, 4])] ) )
Importance Against Expectation
Compare Actual Purchasing against Expectation Given General Purchasing Patterns
Alpha
Bravo
Charlie
Delta
Other
Share of Category Purchasing
69%
18%
8%
4%
1%
Volume Market Share
44%
40%
10%
4%
1%
Share of Residual Purchasing - Alpha Buyers
221%
59%
26%
12%
3%
Share of Residual Purchasing - Category Buyers
79%
71%
19%
8%
2%
Index Against Expectation
280
84
139
153
105
Code
index_ae = pl.DataFrame({'Index Against Expectation - By Volume Purchase': index_against_expect[1:],'Brands': cross_purch_df.columns[1:-1]})# Create the Altair chartalt.Chart(index_ae).mark_bar().encode( x=alt.X("Index Against Expectation - By Volume Purchase:Q"), y=alt.Y("Brands:N"),).properties( width=650, height=250, title={"text": 'Importance Against Expectation'},)
Let us consider Charlie. With reference to Charlie’s residual share of category purchasing not accounted for by Alpha, we see that it accounted for 26% of the category purchasing by Alpha buyers that did not go to Alpha. If the purchasing of Alpha buyers was consistent with overall market patterns (as reflected in the volume market shares), we would expect Charlie to account for 19% of their purchasing. We can therefore say that Charlie’s share of purchasing amongst the buyers of Alpha is above expectation (when expectation is based on overall patterns of buying behaviour).
The Index Against Expectation essentially captures whether or not and to what extent volume purchases of Alpha’s buyers went to its competitors above or below expectations / patterns observed in the overall market.
We see that, relative to market share, Bravo is less of threat to Alpha than we would expect (index = 84). Relative to market share, Charlie and Delta are purchased more by buyers of Alpha than we would expect.
5.4.4 Cross Purchasing - By Spend
We can repeat these analyses using spend rather than volume purchasing.
We see, for example, that buyers of Alpha spent 71% of their category spend on Alpha. This is in contrast to the 69% of their category volume requirements satisfied by Alpha.
Code
brand_spend_purchases = spend_pivot(grocery_lf, 1).fill_null(0)# Step 1: Create ever buyers matrixever_buyers = brand_spend_purchases.select( [pl.when(pl.col(col) >0).then(1).otherwise(0).alias(col) for col in brand_purchases.columns]).to_numpy()# Step 2: Compute cross purchase matrixcross_purch_matrix = ever_buyers.T @ brand_spend_purchasestotal_spend_purchases = np.sum(cross_purch_matrix, axis=1).reshape(5, 1)# Step 3: Compute cross purchase matrix as a % of total spend of each brandcross_purch = cross_purch_matrix / total_spend_purchases# DataFrame conversion for tabular viewcross_purch_matrix_df = ( pl.from_numpy(cross_purch_matrix, schema=brand_spend_purchases.columns) .with_columns(pl.Series('Brands', brand_spend_purchases.columns)))cross_purch_df = ( pl.from_numpy(cross_purch, schema=brand_spend_purchases.columns) .with_columns(pl.Series('Brands', brand_spend_purchases.columns)))
Code
( GT(cross_purch_matrix_df, rowname_col='Brands') .tab_header(title='Cross Purchasing Analysis', subtitle='Total Category Spend by Buyers of the Brand (by row)') .tab_stubhead(label="Brands") .sub_missing(missing_text="") .fmt_integer() .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[0, 10_000], palette=['white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('lightgrey'), ], locations=[ loc.body(0,0), loc.body(1,1), loc.body(2,2), loc.body(3,3), loc.body(4,4) ] ) .tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / Total Spend on [Brands]_""")))
Cross Purchasing Analysis
Total Category Spend by Buyers of the Brand (by row)
Brands
Alpha
Bravo
Charlie
Delta
Other
Alpha
33,571
8,913
2,606
1,807
571
Bravo
9,292
28,603
2,076
1,008
1,022
Charlie
4,481
3,854
5,121
793
129
Delta
2,597
1,149
806
3,272
57
Other
825
1,337
118
74
1,535
Read Row/Column as: Purchasers of [Brands] / Total Spend on [Brands]
Code
( GT(cross_purch_df, rowname_col='Brands') .tab_header(title='Cross Purchasing Analysis', subtitle='% of Total Category Spend by Buyers of the Brand (by row)') .tab_stubhead(label="Brands") .sub_missing(missing_text="") .fmt_percent(decimals=0) .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], domain=[0, 0.8], palette=['white', 'rebeccapurple'], ) .tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / % Total Spend on [Brands]_""")))
Cross Purchasing Analysis
% of Total Category Spend by Buyers of the Brand (by row)
Brands
Alpha
Bravo
Charlie
Delta
Other
Alpha
71%
19%
5%
4%
1%
Bravo
22%
68%
5%
2%
2%
Charlie
31%
27%
36%
6%
1%
Delta
33%
15%
10%
42%
1%
Other
21%
34%
3%
2%
39%
Read Row/Column as: Purchasers of [Brands] / % Total Spend on [Brands]
5.4.5 Importance Against Expectation - By Spend
We can create a spend-based importance against expectation plot, using value market share as the reference.
Code
# Share of Category Purchasing - Alpha Brand Selectedshare_of_cat_purch = cross_purch[0]# Market Share - Volumemarket_share = ( brand_spend_purchases.sum() /# sum all columns -> returns a single row dataframe brand_spend_purchases.sum_horizontal().sum() # sum of total category volume sales -> retruns a scalar value).to_numpy().reshape(5)# Share of residual purchasing -- Alpha Buyers# The percentage of the category purchasing not accounted for by Alpha that goes to each of the other brandsresidual_purch_brand = share_of_cat_purch / (1- share_of_cat_purch[0])# Share of residual purchasing -- Category Buyers# Residual share of category purchasing (across all category buyers) once Alpha is removed# when we exclude Alpha, what percentage of (the remaining) category purchasing goes to each of the other brands)residual_purch_category = market_share / (1- market_share[0])# Index against expectationindex_against_expect =100* residual_purch_brand/residual_purch_category# Importance Against Expectations - Create DataFramestack = np.vstack(( share_of_cat_purch, market_share, residual_purch_brand, residual_purch_category, index_against_expect))cols = pl.DataFrame({'Col': ['Share of Category Purchasing', 'Volume Market Share', 'Share of Residual Purchasing - Alpha Buyers', 'Share of Residual Purchasing - Category Buyers','Index Against Expectation']})rows = cross_purch_df.columns[:-1]importance_against_expect = pl.from_numpy(stack, schema=rows).hstack(columns=cols)
Code
( GT(importance_against_expect, rowname_col='Col') .tab_header(title='Importance Against Expectation', subtitle='Compare Actual Purchasing against Expectation Given General Purchasing Patterns') .fmt_percent(rows=list(range(4)), decimals=0) .fmt_integer(rows=4) .data_color( columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'], rows=4, domain=[80, 170], palette=['white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('white'), ], locations=[loc.body(0,[2, 3, 4])] ) )
Importance Against Expectation
Compare Actual Purchasing against Expectation Given General Purchasing Patterns
Alpha
Bravo
Charlie
Delta
Other
Share of Category Purchasing
71%
19%
5%
4%
1%
Volume Market Share
47%
40%
7%
5%
2%
Share of Residual Purchasing - Alpha Buyers
242%
64%
19%
13%
4%
Share of Residual Purchasing - Category Buyers
87%
74%
13%
8%
4%
Index Against Expectation
277
86
141
153
103
Code
index_ae = pl.DataFrame({'Index Against Expectation - By Spend': index_against_expect[1:],'Brands': cross_purch_df.columns[1:-1]})# Create the Altair chartalt.Chart(index_ae).mark_bar().encode( x=alt.X("Index Against Expectation - By Spend:Q"), y=alt.Y("Brands:N"),).properties( width=650, height=250, title={"text": 'Importance Against Expectation'})
Garbage Collect
Code
exceptions = ['grocery_lf', 'sku_lf', 'kiwi_lf', 'In', 'exceptions', 'active_variables', 'penetration', 'ppb', 'panel_size']active_variables = [ var for var, value inglobals().items()ifnot var.startswith('_') # Exclude variables that start with "_"and var notin exceptions # Exclude variables in the exceptions listandisinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only]for var in active_variables:delglobals()[var]del active_variables, exceptions, vargc.collect()
387
6 Exploring Dynamics in Buyer Behaviour
Up to now, we have been characterizing buyer behavior in a given time period (i.e., one year), be it focusing on one brand or multiple brands. We now consider some standard analyses that give insight into the dynamics of buyer behavior from period to period. We first consider the case of established products and then turn our attention to the analysis of new product buying behavior.
6.1 Established Products
We are interested in summarizing how buyer behavior varies across consecutive periods. We first consider how temporal variations in total sales can be understood by decomposing total sales. Next we explore temporal variations in customer-level purchasing by examining how the distribution of purchasing in one period varies as a function of the level of purchasing in the previous period. Finally, we consider a summary measure of period-to-period purchasing called the repeat rate.
6.1.1 Understanding Temporal Variations in Sales
Most firms have systems that will report sales over time. As we try to make sense of any observed changes, it is helpful to note a fundamental (multiplicative) sales decomposition. For any time period,
\[
\begin{aligned}
\text{Sales} &= \text{# households (HHs) in the country} \\
&= \times \text{proportion of HHs buying the brand (penetration)} \\
&= \times \text{# purchase occasions per buyer (PPB)} \\
&= \times \text{# packs per purchase} \\
&= \times \text{weight or price per pack} \\
\end{aligned}
\]
There is nothing magical about this specific decomposition. We can create variations on a theme that are more relevant for the specific analysis setting at hand. For example, suppose we are doing an analysis at the brand level, where the SKUs associated with the brand come in different sizes. Furthermore, suppose the time period is sufficiently small that households make only one purchase per period, if at all. (In other words, PPB = 1). A more relevant decomposition would be
\[
\begin{aligned}
\text{Sales (\$)} &= \text{# households (HHs) in the country} \\
&= \times \text{proportion of HHs buying the brand (penetration)} \\
&= \times \text{average volume per purchase} \\
&= \times \text{average price per unit of volume} \\
\end{aligned}
\]
The product of the last two quantities is often called average order value (AOV).
\[
AOV = \text{average volume per purchase} \times \text{average price per unit of volume}
\]
Recall the plot of Alpha’s revenue we created earlier:
Code
weekly_plot(dataframe=weekly_spend_summary('Alpha', grocery_lf).collect(), y='Weekly Spend', title='Plot of weekly revenue for Alpha', y_axis_label='Spend ($)', pct=False, legend=False)
We observe some weeks where this is a massive increase in revenue. How much of this is due to an increase in penetration versus, say, buyers simply buying more product on a given purchase occasion? Let us explore this using the second decomposition given above.
We start by computing Alpha’s weekly penetration numbers:
Check whether purchase purchase occasions per buyer (PPB) = 1.
We do this by grouping the panel data by ‘week’ and aggregating to get the number of unique ‘trans_id’ and ‘panel_id’ per week. If the ‘num_trans’ and ‘num_buyers’ is the same for each week, then no panellist made more than one purchase in any given week, which means PPB = 1.
We produce a new dataframe that is grouped by 104 weeks and aggregates the count of unique number of panellists (weekly penetration) and number of transaction IDs each week.
Output: The number of panellists that made at least one purchase of Alpha; the total number of category purchase occasions on which Alpha was purchased, for each week.
The next step is to add the weekly volume and revenue numbers:
Join SKU weights data with panel data, group panel data by week, compute volume and spend.
Output: Alpha’s (dollar and volume) sales.
Next we compute the numbers associated with the revenue decomposition:
Weekly penetration is simply the number of panellists active in any given week divided by the size of the panel.
Average order value is simply total revenue for any given week divided by the number of panellists that made at least one purchase of Alpha in that week.
As noted above, the AOV quantity can be decomposed into average order volume and average price per unit volume (in this case, kg).
Average order volume is simply (total) volume sold in any given week divided by the number of panellists that made at least one purchase of Alpha in that week.
Average price per unit volume is simply total revenue for any given week divided by (total) volume sold in that week.
We note that the number of transactions (num_trans) associated with each week equals the number of panellists (num_buyers) associated with each week. In other words, no panellist made more than one purchase in any given week, which means PPB = 1 (num_trans / num_buyers), as assumed above.
Code
# Filter for rows where the brand is "Alpha"alpha_lf = grocery_lf.filter(pl.col("brand") =="Alpha")# Aggregating to get the number of unique trans_id and panel_id per weekalpha_weekly_trans = ( alpha_lf .group_by("week") .agg( pl.col("trans_id").n_unique().alias("num_trans"), pl.col("panel_id").n_unique().alias("num_buyers") ) )# Aggregating to get the sum of spend and volume per weekalpha_weekly_spend_vol = ( alpha_lf .with_columns(((pl.col('units') * pl.col('price'))).alias('spend')) .join(other=sku_lf, on='sku_id') .with_columns((((pl.col('units') * pl.col('weight'))/1000)).alias('volume')) .group_by('week').agg( pl.col("spend").sum().alias("spend"), pl.col("volume").sum().alias("volume") ))# The alpha_weekly_trans & alpha_weekly_spend_vol are joined.# alpha_weekly_summary now contains week, num_trans, num_buyers, spend, and volumealpha_weekly_summary = alpha_weekly_trans.join(alpha_weekly_spend_vol, on="week").sort('week')alpha_weekly = ( alpha_weekly_summary .with_columns(# Weekly penetration = num_buyer / panel_sizes (pl.col('num_buyers') / panel_size).alias('penet'),# PPB = num_trans / num_buyer (pl.col('num_trans') / pl.col('num_buyers')).alias('ppb'),# Average Order Value = spend / num_trans (pl.col('spend') / pl.col('num_trans')).alias('aov'),# Average Order Value = spend / num_trans (pl.col('volume') / pl.col('num_trans')).alias('aovol'),# Average price per unit volume = spend / volume (pl.col('spend') / pl.col('volume')).alias('avg_price_kg') ).select('week', 'spend', 'penet', 'aov', 'aovol', 'avg_price_kg' ))
Now we compute the correlations between weekly revenue and the components of its (multiplicative) decomposition across the two years.
Correlations between Weekly Revenue & its Components - Over 2 Years
Code
corr_matrix = alpha_weekly.select(pl.all().exclude('week')).collect().corr()relabel = pl.Series(['Revenue', 'Penetration', 'AOV', 'AOVOL', 'Price/Unit'])corr_matrix = corr_matrix.with_columns(pl.Series(relabel).alias("Decomposition"))( GT(corr_matrix, rowname_col='Decomposition') .tab_header(title='Correlation Matrix of Revenue and the Components of its Decomposition', subtitle='Correlations between weekly revenue and the components of its (multiplicative) decomposition') .cols_label( spend=relabel[0], penet=relabel[1], aov=relabel[2], aovol=relabel[3], avg_price_kg=relabel[4] ) .cols_align(align='center') .fmt_number(decimals=3) .data_color( columns=['spend', 'penet', 'aov', 'aovol', 'avg_price_kg'], domain=[-1, 1], palette=['orange', 'white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('lightgrey'), ], locations=[ loc.body(0,0), loc.body(1,1), loc.body(2,2), loc.body(3,3), loc.body(4,4) ] ))
Correlation Matrix of Revenue and the Components of its Decomposition
Correlations between weekly revenue and the components of its (multiplicative) decomposition
Revenue
Penetration
AOV
AOVOL
Price/Unit
Revenue
1.000
0.982
0.542
0.758
−0.590
Penetration
0.982
1.000
0.387
0.703
−0.654
AOV
0.542
0.387
1.000
0.691
−0.073
AOVOL
0.758
0.703
0.691
1.000
−0.754
Price/Unit
−0.590
−0.654
−0.073
−0.754
1.000
Below is a plot of the penetration numbers. Comparing this with Alpha’s weekly revenue plot over the two years, we see that the fluctuations in revenue go hand-in-hand with fluctuations in penetration. To make this comparison clearer, we overlay the two series of numbers as well. The correlation between these to quantities is 0.98. It would appear that the key driver of revenue increases is simply more people buying the brand that week.
# Method 1: Fold Transformation- convert wide-form data to long-form data directly without any preprocessingsource = alpha_weekly.select('week', 'spend', 'penet').collect()base = alt.Chart(source).mark_line().transform_fold( ['Penetration', 'Revenue'], as_=['Legend', 'Value']).encode( alt.Color('Legend:N'), alt.X('week', axis=alt.Axis( values=np.arange(0, 104+1, 13), # Explicitly specify quarter-end weeks labelExpr="datum.value", # Show only these labels title='Week')))spend_line = base.transform_filter( alt.datum.Legend =='Revenue' ).encode( y = alt.Y('spend:Q', # Q = a continuous real-valued quantity axis=alt.Axis(format="$,.0f")).title('Revenue ($)'))penet_line = base.transform_filter( alt.datum.Legend =='Penetration').encode( y = alt.Y('penet:Q', # Q = a continuous real-valued quantity axis=alt.Axis(format=",.0%")).title('Penetration'))alt.layer(spend_line, penet_line).resolve_scale( y='independent').properties( width=600, height=250, title='Weekly Revenue & Penetration for Alpha').configure_view(stroke=None).configure_axisY(grid=True)
Code
# Method 2: Pre-processed - Wide-form to long-form conversionsource = alpha_weekly.select('week', 'spend', 'penet').unpivot( index='week', on=['spend', 'penet']).collect()base = alt.Chart(source).mark_line().encode( alt.Color('variable:N'), alt.X('week', axis=alt.Axis( values=np.arange(0, 104+1, 13), # Explicitly specify quarter-end weeks labelExpr="datum.value", # Show only these labels title='Week')))spend_line = base.transform_filter( alt.datum.variable =='spend' ).encode( y = alt.Y('value:Q', # Q = a continuous real-valued quantity axis=alt.Axis(format="$,.0f")).title('Revenue ($)'))penet_line = base.transform_filter( alt.datum.variable =='penet').encode( y = alt.Y('value:Q', # Q = a continuous real-valued quantity axis=alt.Axis(format=",.0%")).title('Penetration'))alt.layer(spend_line, penet_line).resolve_scale( y='independent').properties( width=600, height=250, title='Weekly Revenue & Penetration for Alpha').configure_view(stroke=None).configure_axisY(grid=True)
However, the lack of a perfect correlation means there is some variability in average order value that is not highly correlated with penetration. We plot this quantity below. (The correlation between penetration and average order value is 0.39.) In order to get a sense of what lies behind the variability in average order value, we also plot weekly average order volume and average price per kg, respectively.
Code
def weekly_plot(dataframe, y, year=2, title="", y_axis_label="", fmt='currency'):if fmt =='currency':format="$,.1f"elif fmt =='percent':format=",.0%"else:format=",.1f" chart = alt.Chart(dataframe).mark_line(strokeWidth=1).encode( x = alt.X('week', axis=alt.Axis( values=np.arange(0, (year*52) +1, 13), # Explicitly specify quarter-end weeks labelExpr="datum.value", # Show only these labels title='Week' ) ), y = alt.Y(f'{y}:Q', # Q = a continuous real-valued quantity title=y_axis_label, axis=alt.Axis(format=format) ) ).properties( width=650, height=250, title=title ).configure_view( stroke=None )return chart
Code
weekly_plot(dataframe=alpha_weekly.select('week', 'aov').collect(), y='aov', title='Alpha - Weekly Average Order Value', y_axis_label='Average Order Value ($)', fmt='currency')
Code
weekly_plot(dataframe=alpha_weekly.select('week', 'aovol').collect(), y='aovol', title='Alpha - Weekly Average Order Volume (kg)', y_axis_label='Average Order Volume (kg)', fmt='number')
Looking at the weekly average price per kg plot, there was much less variability in price/kg in weeks 1–52 compared to weeks 53–104. It would appear that there was some change in promotion policy between years 1 and 2. We do not have the data to explore this further, however.
Correlations between Weekly Revenue & its Components - Year 1
Code
corr_matrix = alpha_weekly.filter(pl.col('week') <=52).select(pl.all().exclude('week')).collect().corr()relabel = pl.Series(['Revenue', 'Penetration', 'AOV', 'AOVOL', 'Price/Unit'])corr_matrix = corr_matrix.with_columns(pl.Series(relabel).alias("Decomposition"))( GT(corr_matrix, rowname_col='Decomposition') .tab_header(title='Correlation Matrix of Revenue and the Components of its Decomposition', subtitle='Correlations between weekly revenue and the components of its (multiplicative) decomposition') .cols_label( spend=relabel[0], penet=relabel[1], aov=relabel[2], aovol=relabel[3], avg_price_kg=relabel[4] ) .cols_align(align='center') .fmt_number(decimals=3) .data_color( columns=['spend', 'penet', 'aov', 'aovol', 'avg_price_kg'], domain=[-1, 1], palette=['orange', 'white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('lightgrey'), ], locations=[ loc.body(0,0), loc.body(1,1), loc.body(2,2), loc.body(3,3), loc.body(4,4) ] ))
Correlation Matrix of Revenue and the Components of its Decomposition
Correlations between weekly revenue and the components of its (multiplicative) decomposition
Revenue
Penetration
AOV
AOVOL
Price/Unit
Revenue
1.000
0.990
0.788
0.840
−0.534
Penetration
0.990
1.000
0.707
0.778
−0.553
AOV
0.788
0.707
1.000
0.956
−0.349
AOVOL
0.840
0.778
0.956
1.000
−0.606
Price/Unit
−0.534
−0.553
−0.349
−0.606
1.000
Correlations between Weekly Revenue & its Components - Year 2
Code
corr_matrix = alpha_weekly.filter(pl.col('week') >=53).select(pl.all().exclude('week')).collect().corr()relabel = pl.Series(['Revenue', 'Penetration', 'AOV', 'AOVOL', 'Price/Unit'])corr_matrix = corr_matrix.with_columns(pl.Series(relabel).alias("Decomposition"))( GT(corr_matrix, rowname_col='Decomposition') .tab_header(title='Correlation Matrix of Revenue and the Components of its Decomposition', subtitle='Correlations between weekly revenue and the components of its (multiplicative) decomposition') .cols_label( spend=relabel[0], penet=relabel[1], aov=relabel[2], aovol=relabel[3], avg_price_kg=relabel[4] ) .cols_align(align='center') .fmt_number(decimals=3) .data_color( columns=['spend', 'penet', 'aov', 'aovol', 'avg_price_kg'], domain=[-1, 1], palette=['orange', 'white', 'rebeccapurple'], ) .tab_style( style=[ style.text(color="white"), style.fill('lightgrey'), ], locations=[ loc.body(0,0), loc.body(1,1), loc.body(2,2), loc.body(3,3), loc.body(4,4) ] ))
Correlation Matrix of Revenue and the Components of its Decomposition
Correlations between weekly revenue and the components of its (multiplicative) decomposition
Revenue
Penetration
AOV
AOVOL
Price/Unit
Revenue
1.000
0.978
0.414
0.770
−0.685
Penetration
0.978
1.000
0.227
0.710
−0.748
AOV
0.414
0.227
1.000
0.602
−0.035
AOVOL
0.770
0.710
0.602
1.000
−0.804
Price/Unit
−0.685
−0.748
−0.035
−0.804
1.000
Annual Revenue Decomposition
This same logic can be used to analyse change across longer time periods. For example, we note from our analysis earlier that Alpha’s revenue increased from $33,571 in year 1 to $35,251 in year 2, a 5% increase. As we are looking at a static sample (i.e., the same group of panellists over the two years), the increase cannot be due to population growth. Are we observing this growth because more households are buying the product (increase in buyers), and/or because those that are buying it are buying it more often (increase in frequency), and/or because the average spend per transaction has increased (increase in AOV)?
In order to explore this, we will use the following (multiplicative) sales decomposition:
\[
\begin{aligned}
\text{Annual Revenue}&= \text{# households (HHs) in the panel} \\
&= \times \text{proportion of HHs buying the brand (penetration)} \\
&= \times \text{# purchase occasions per buyer (PPB)} \\
&= \times \text{average order volume} \\
&= \times \text{average price per unit of volume} \\
\end{aligned}
\]
In order to perform a similar decomposition of annual revenue, we first need to create a dataset that summarises, for each year, (1) the number of panellists that made at least one purchase of Alpha, (2) the total number of category purchase occasions on which Alpha was purchased, and (3) Alpha’s (dollar and volume) sales. We use the same logic as above, aggregating by year as opposed to week.
compute the number of panellists that made at least one transaction
compute total number of category purchase occasions (number of transactions) on which Alpha was purchased
compute total revenue and volume for Alpha by year
compute the components of the revenue decomposition
Penetration is the number of households that made at least purchase of Alpha in the year divided by the size of the panel.
PPB is the total number of transactions associated with Alpha divided by the number of households that made at least purchase of Alpha in the year.
Average Order Value is annual revenue divided by the total number of transactions associated with Alpha.
Average Order Volume is annual total volume divided by the total number of transactions associated with Alpha.
Average price/kg is revenue divided by total volume.
Computing the percentage changes
Code
alpha_yearly_summary = ( grocery_lf .filter(pl.col('brand') =='Alpha') .join(other=sku_lf, on='sku_id') .with_columns( (pl.col('units') * pl.col('price')).cast(pl.Float64).alias('spend'), (pl.col('units') * pl.col('weight')/1000).alias('volume'), (pl.col('week') /52).ceil().cast(pl.UInt16).alias('year') ).group_by('year') .agg(# Total number of category purchase occasions each year pl.col('trans_id').n_unique().alias('num_trans'),# The number of panellists that made at least one transaction each year pl.col('panel_id').n_unique().alias('num_buyers'),# Total revenue generated each year pl.col('spend').sum(),# Total volume purchased each year pl.col('volume').sum() ))alpha_yearly = ( alpha_yearly_summary .with_columns(# Yearly penetration = num_buyer / panel_sizes (pl.col('num_buyers') / panel_size).alias('penet'),# PPB = num_trans / num_buyer (pl.col('num_trans') / pl.col('num_buyers')).alias('ppb'),# Average Order Value = spend / num_trans (pl.col('spend') / pl.col('num_trans')).alias('aov'),# Average Order Value = spend / num_trans (pl.col('volume') / pl.col('num_trans')).alias('aovol'),# Average price per unit volume = spend / volume (pl.col('spend') / pl.col('volume')).alias('avg_price_kg') ).with_columns(pl.col('year').cast(pl.String)) .collect())# Transpose to cross-tabulate summary for presentationalpha_yearly_T = alpha_yearly.transpose( include_header=True, header_name='Summary', column_names='year')# Compute percentage change for each variable from year 1 to year 2alpha_yearly_pct_change = ( alpha_yearly .with_columns(pl.col('*').exclude('year').pct_change()) .transpose(include_header=True, header_name='Summary', column_names='year') .select('Summary' , pl.col('2').alias('pct_change')) )# Join summary variables and percentage change coss-tabular data for presentationalpha_yearly_T = alpha_yearly_T.join(alpha_yearly_pct_change, on='Summary')
We note that the 5% increase in revenue from year 1 to year 2 is associated with a 5% increase in the number of households making at least one purchase of Alpha in the year. While PPB drops, average order value increases, with these two changes effectively cancelling out each other. (The product of these two quantities changes by one cent between the two years. i.e, \((PPB_{2} \times AOV_{2})-(PPB_{1} \times AOV_{1})\).) While the average price/kg drops from year 1 to year 2, this is more than compensated by the increase in average order volume, resulting in a 3% increase in average order value between the two years (AOR = Average Order Volume x Avearge Price/Unit).
6.1.2 Temporal Variation in Customer-level Purchasing - by Brand Transactions
We have observed that the number of buyers has increased, yet the average number of transactions (PPB) has dropped. Does this mean the “new” buyers are light buyers? Or are the existing buyers buying less?
In order to dig deeper, we need to examine temporal variation in customer-level purchasing. A natural starting point is to examine the joint distribution of purchasing for two consecutive periods.
We first need to process the panel data so it summarizes the number of times Alpha was purchased in years 1 and 2 by each panellist (num_trans).
Then we prepare the joint distribution of purchasing table that summarizes customer-level purchasing for all combinations of transaction levels in year 1 and year 2. The values essentially is a count of panellists that made \(x_{1}\) transactions in year 1 and \(x_{2}\) transactions in year 2.
Next, we compute the conditional distribution using the joint distribution matrix computed earlier. We divide each element of the matrix by their respective row total (row percentages). The conditional distribution of transaction counts is essentially the empirical probability of making \(𝑥_{2}\) transactions in year 2 given the panellist made \(𝑥_{1}\) transactions in year 1.
Given that we have computed the joint distribution of panellist count for all transaction-levels for years 1 and 2 (or simply: joint distribution of transaction counts), it is a simple exercise to compute the row percentages, giving us the conditional distributions of transaction counts.
Finally, we compute and plot the marginal distribution of all transaction levels each year. Marginal distribution is the proportion of panellists who made \(x_{t}\) transactions in year 1 and year 2 relative to the count of total panellists.
Code
# Summarizes the number of times Alpha was purchased in years 1 and 2 by each panellistalpha_yearly_trans = ( grocery_lf .filter(pl.col('brand') =='Alpha') .with_columns((pl.col('week') /52).ceil().cast(pl.UInt16).alias('year'))# Aggregate to count unique 'trans_id' by 'panel_id' and 'year' .group_by('panel_id','year') .agg(pl.col('trans_id').n_unique().alias('num_trans')))alpha_yearly_trans = (# Reshape from long format to wide format# Pivoting the dataframe based on 'year' to create a wide format. alpha_yearly_trans.collect().pivot( on='year', # Each unique year will create a new column values='num_trans', # The column to aggregate index='panel_id'# Rows will be indexed by 'panel_id' )# Rename the columns to 'Year 1' and 'Year 2' .rename({'1': 'Year 1', '2': 'Year 2'}) # Replace any null values with 0 .fill_null(0) )# Create a basic joint distributionjoint_dist_trans = ( alpha_yearly_trans .group_by('Year 1', 'Year 2') .agg(pl.col('panel_id').count().alias('count')))# Add in the number of panellists that made no purchase of Alpha in either year# The processed dataframe so far is a subset of the main panel data containing the purchasing of those that bought Alpha at least once in the two years, there are 3142 such households.# The panel contains 5021 panellists. Therefore the correct number of households that made zero purchases of Alpha in years 1 and 2 should be included in the processed dataframe.tmp = panel_size - joint_dist_trans.select((pl.col('count')).sum()).item(0,0)zero_purch = pl.DataFrame({'Year 1': 0, 'Year 2': 0, 'count': tmp})# Right censor the distribution at 10+joint_dist_trans = ( joint_dist_trans.with_columns( pl.when(pl.col('Year 1') >9) .then(10) .otherwise(pl.col('Year 1')) .alias('Year 1') ) .with_columns( pl.when(pl.col('Year 2') >9) .then(10) .otherwise(pl.col('Year 2')) .alias('Year 2') ) .vstack(zero_purch.cast(pl.UInt32)))# Unpivot the right censor data to create a year 1 x year 2 distribution matrix of the count of panellists who made x_{1} in year 1 & x_{2} in year 2joint_dist_trans_pivot = ( joint_dist_trans .sort('Year 2') .pivot( index='Year 1', on='Year 2', values='count', aggregate_function='sum' ) .rename({'Year 1': 'Year'}) .sort('Year') .fill_null(0) )# Compute the row percentages, giving us the conditional distributions of transaction counts (the % of panellists from year 1 and their associated transaction counts in year 2)# The conditional probability here is he empirical probability of making 𝑥_{2} transactions in year 2 given the panellist made 𝑥_{1} transactions in year 1year1_trans_total = joint_dist_trans_pivot.select( pl.col('*').exclude('Year')).sum_horizontal()pct_year1_total = joint_dist_trans_pivot.with_columns( (pl.col('*').exclude('Year') / year1_trans_total))
Code
# Summarizes Alpha transactions by year and panelistalpha_yearly_trans = ( grocery_lf .filter(pl.col('brand') =='Alpha') .with_columns((pl.col('week') /52).ceil().cast(pl.UInt16).alias('year')) .group_by(['panel_id', 'year']) .agg(pl.col('trans_id').n_unique().alias('num_trans')) .collect() .pivot( values='num_trans', index='panel_id', on='year' ) .rename({'1': 'Year 1 Trans', '2': 'Year 2 Trans'}) .fill_null(0))# Create joint distributionjoint_dist_trans = ( alpha_yearly_trans .group_by(['Year 1 Trans', 'Year 2 Trans']) .agg(pl.len().alias('count')))# Add zero-purchase panellistszero_purchasers = panel_size - joint_dist_trans['count'].sum()joint_dist_trans = joint_dist_trans.vstack( pl.DataFrame({'Year 1 Trans': [0], 'Year 2 Trans': [0], 'count': [zero_purchasers]}).cast(pl.UInt32))# Right-censor transactions at 10+joint_dist_trans = joint_dist_trans.with_columns([ pl.when(pl.col('Year 1 Trans') >9).then(10).otherwise(pl.col('Year 1 Trans')).alias('Year 1 Trans'), pl.when(pl.col('Year 2 Trans') >9).then(10).otherwise(pl.col('Year 2 Trans')).alias('Year 2 Trans')])# Create the year 1 x year 2 distribution matrixjoint_dist_trans_pivot = ( joint_dist_trans .sort('Year 2 Trans') .pivot( values='count', index='Year 1 Trans', on='Year 2 Trans', aggregate_function='sum' ) .rename({'Year 1 Trans': '# Trans'}) .sort('# Trans') .fill_null(0))# Compute conditional distribution (row percentages)row_totals = joint_dist_trans_pivot.select(pl.col('*').exclude('# Trans')).sum_horizontal()pct_joint_dist = joint_dist_trans_pivot.with_columns( (pl.col('*').exclude('# Trans') / row_totals))# Display final resultsprint("Alpha Yearly Transactions:")display(alpha_yearly_trans)print("Joint Distribution:")display(joint_dist_trans)print("Joint Distribution - Year 1 x Year 2 Distribution Matrix:", )display(joint_dist_trans_pivot)print("Conditional Distribution (% of Year 1 Transactions):")display(pct_joint_dist)
Alpha Yearly Transactions:
shape: (3_142, 3)
panel_id
Year 1 Trans
Year 2 Trans
u32
u32
u32
3106102
7
7
3112963
2
3
3106069
0
2
3109622
0
2
3111357
4
3
…
…
…
3102011
1
0
3113422
1
0
3102803
0
1
3103344
1
0
3106812
0
4
Joint Distribution:
shape: (173, 3)
Year 1 Trans
Year 2 Trans
count
u32
u32
u32
10
8
1
7
5
15
2
7
9
10
10
1
10
2
1
…
…
…
5
6
23
8
8
10
9
9
5
7
4
15
0
0
1879
Joint Distribution - Year 1 x Year 2 Distribution Matrix:
shape: (11, 12)
# Trans
0
1
2
3
4
5
6
7
8
9
10
u32
u32
u32
u32
u32
u32
u32
u32
u32
u32
u32
u32
0
1879
342
105
39
18
9
4
1
0
0
0
1
259
201
128
79
40
14
6
3
2
0
1
2
83
120
108
80
75
27
11
9
3
1
0
3
25
60
78
83
65
54
21
9
3
1
1
4
8
28
62
45
54
34
26
8
3
6
3
…
…
…
…
…
…
…
…
…
…
…
…
6
1
6
15
17
24
31
20
13
11
3
3
7
0
2
7
5
15
15
16
14
8
4
15
8
1
1
3
5
9
9
16
6
10
8
3
9
1
0
3
4
4
4
7
8
4
5
7
10
0
1
3
0
3
8
12
14
12
5
49
Conditional Distribution (% of Year 1 Transactions):
shape: (11, 12)
# Trans
0
1
2
3
4
5
6
7
8
9
10
u32
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
0
0.783897
0.142678
0.043805
0.01627
0.007509
0.003755
0.001669
0.000417
0.0
0.0
0.0
1
0.353342
0.274216
0.174625
0.107776
0.05457
0.0191
0.008186
0.004093
0.002729
0.0
0.001364
2
0.160542
0.232108
0.208897
0.154739
0.145068
0.052224
0.021277
0.017408
0.005803
0.001934
0.0
3
0.0625
0.15
0.195
0.2075
0.1625
0.135
0.0525
0.0225
0.0075
0.0025
0.0025
4
0.028881
0.101083
0.223827
0.162455
0.194946
0.122744
0.093863
0.028881
0.01083
0.021661
0.01083
…
…
…
…
…
…
…
…
…
…
…
…
6
0.006944
0.041667
0.104167
0.118056
0.166667
0.215278
0.138889
0.090278
0.076389
0.020833
0.020833
7
0.0
0.019802
0.069307
0.049505
0.148515
0.148515
0.158416
0.138614
0.079208
0.039604
0.148515
8
0.014085
0.014085
0.042254
0.070423
0.126761
0.126761
0.225352
0.084507
0.140845
0.112676
0.042254
9
0.021277
0.0
0.06383
0.085106
0.085106
0.085106
0.148936
0.170213
0.085106
0.106383
0.148936
10
0.0
0.009346
0.028037
0.0
0.028037
0.074766
0.11215
0.130841
0.11215
0.046729
0.457944
Code
relabel = pl.Series([f'{i}'if i <10else'10+'for i inrange(11)])joint_dist_trans_pivot = joint_dist_trans_pivot.select(pl.col('*').exclude('# Trans')).with_columns(relabel.alias('# Trans'))( GT(joint_dist_trans_pivot, rowname_col='# Trans') .tab_header(title='Joint Distribution of the Purchasing of Alpha in Year 1 & 2') .tab_spanner(label='# of Transactions in Year 2', columns=joint_dist_trans_pivot.columns[:-1]) .tab_stubhead(label='Y1 / Y2') .cols_label({'10': '10+'}) .cols_align(align='center') .fmt_integer() .data_color( columns=[str(i) for i inrange(10)], domain=[0, 1900], palette=['white', 'rebeccapurple'], ) .tab_source_note(md("""**Read Row/Column as**: _# of Transactions in Year 1 / # of Transactions in Year 2_""")) )
Joint Distribution of the Purchasing of Alpha in Year 1 & 2
Y1 / Y2
# of Transactions in Year 2
0
1
2
3
4
5
6
7
8
9
10+
0
1,879
342
105
39
18
9
4
1
0
0
0
1
259
201
128
79
40
14
6
3
2
0
1
2
83
120
108
80
75
27
11
9
3
1
0
3
25
60
78
83
65
54
21
9
3
1
1
4
8
28
62
45
54
34
26
8
3
6
3
5
5
13
28
31
49
46
23
20
5
3
4
6
1
6
15
17
24
31
20
13
11
3
3
7
0
2
7
5
15
15
16
14
8
4
15
8
1
1
3
5
9
9
16
6
10
8
3
9
1
0
3
4
4
4
7
8
4
5
7
10+
0
1
3
0
3
8
12
14
12
5
49
Read Row/Column as: # of Transactions in Year 1 / # of Transactions in Year 2
How do we read this table? Year 1, Row 2 tell us how many people who bought Alpha once in year 1 bought Alpha 0, 1, 2, . . . times in year 2. For example, 259 households didn’t buy Alpha in year 2, 201 bought Alpha once in year 2, and so on.
Code
relabel = pl.Series([f'{i}'if i <10else'10+'for i inrange(11)])pct_joint_dist = pct_joint_dist.select(pl.col('*').exclude('# Trans')).with_columns(relabel.alias('# Trans'))( GT(pct_joint_dist, rowname_col='# Trans') .tab_header(title='Conditional Distribution of Transaction Counts', subtitle=md('Empirical probability of making year 2 transaction given year 1 transaction')) .tab_spanner(label='# of Transactions in Year 2', columns=pct_joint_dist.columns[:-1]) .tab_stubhead(label='Y1 / Y2') .cols_label({'10': '10+'}) .cols_align(align='center') .fmt_percent(decimals=0) .data_color( columns=[str(i) for i inrange(11)], domain=[0, 1], palette=['white', 'rebeccapurple'], ) .tab_source_note(md("""**Read Row/Column as**: _# of Transactions in Year 1 / # of Transactions in Year 2_""")) )
Conditional Distribution of Transaction Counts
Empirical probability of making year 2 transaction given year 1 transaction
Y1 / Y2
# of Transactions in Year 2
0
1
2
3
4
5
6
7
8
9
10+
0
78%
14%
4%
2%
1%
0%
0%
0%
0%
0%
0%
1
35%
27%
17%
11%
5%
2%
1%
0%
0%
0%
0%
2
16%
23%
21%
15%
15%
5%
2%
2%
1%
0%
0%
3
6%
15%
20%
21%
16%
14%
5%
2%
1%
0%
0%
4
3%
10%
22%
16%
19%
12%
9%
3%
1%
2%
1%
5
2%
6%
12%
14%
22%
20%
10%
9%
2%
1%
2%
6
1%
4%
10%
12%
17%
22%
14%
9%
8%
2%
2%
7
0%
2%
7%
5%
15%
15%
16%
14%
8%
4%
15%
8
1%
1%
4%
7%
13%
13%
23%
8%
14%
11%
4%
9
2%
0%
6%
9%
9%
9%
15%
17%
9%
11%
15%
10+
0%
1%
3%
0%
3%
7%
11%
13%
11%
5%
46%
Read Row/Column as: # of Transactions in Year 1 / # of Transactions in Year 2
How do we interpret this table? Looking at Year 1, Row 2, we see that 35.3% of the panellists that made one purchase of Alpha in year 1 made no purchases of Alpha in year 2, 27.4% purchased Alpha once, 17.5% purchased Alpha twice, and so on.
Code
# Compute marginal transaction distribution for each year: count of # of panellists who made x_{t} transactions as a % of total panellistsdef marginal_dist(): years = ['Year 1 Trans', 'Year 2 Trans'] dfs = []for year in years: y_dist = ( joint_dist_trans .group_by(year) .agg((pl.col('count').sum() / panel_size).alias(f'% of Total - {year}')) .rename({year: 'num_trans'}) .sort('num_trans') ) dfs.append(y_dist) marginal_dist = dfs[0].hstack(dfs[1].drop('num_trans'))return marginal_distmarginal_dist()
shape: (11, 3)
num_trans
% of Total - Year 1 Trans
% of Total - Year 2 Trans
u32
f64
f64
0
0.477395
0.450508
1
0.145987
0.154153
2
0.102968
0.107548
3
0.079665
0.077275
4
0.055168
0.070902
…
…
…
6
0.02868
0.032264
7
0.020116
0.020912
8
0.014141
0.012149
9
0.009361
0.00717
10
0.02131
0.017128
Code
# marginal distribution for each year and the associated clustered bar chart# Step 1a: Compute marginal transaction distribution for each year: count of # of panellists who made x_{t} transactions as a % of total panellists# Step 1b: Convert Wide-form data to long-form datadef marginal_dist(): years = ['Year 1 Trans', 'Year 2 Trans'] dfs = []for year in years: y_dist = ( joint_dist_trans .group_by(year) .agg((pl.col('count').sum() / panel_size).alias('% of Total')) # marginal distribution of purchasing of Alpha in the first & second year. .with_columns(pl.lit(year).alias('Year')) .rename({year: 'num_trans'}) ) dfs.append(y_dist) marginal_dist = dfs[0].vstack(dfs[1])return marginal_dist# Step 2: Plot long-form data as grouped bar chartalt.Chart(marginal_dist()).mark_bar().encode( x=alt.X("num_trans:O", axis=alt.Axis(labelAngle=0, title='Number of Transactions')), xOffset="Year:N", y=alt.Y("% of Total:Q", axis=alt.Axis(format=".0%", title='% of Households')), color='Year:N').properties( width=650, height=250, title={"text": 'Distribution of Alpha Transactions', "subtitle": 'Marginal Distribution'},)
Let us made two immediate observations: - The marginal distribution of purchasing in the first year is reasonably similar to that for the second year. Some differences that stand out are the smaller percentage of households making zero purchases in year 2, which corresponds to the higher penetration, and the smaller percentage of households buying Alpha ten or more times in year 2. - When first seeing a table that shows the distribution of year 2 purchasing broken down by the level of year 1 purchasing, many expect there to be a strong diagonal in the table and are alarmed by the fact that this is rarely the case. It is important to realize that buying behaviour is not deterministic. From the perspective of the analyst, it can be viewed as-if random, bouncing around each person’s underlying propensity to buy the product. Someone who makes one purchase in year 1 is possibly a light buyer and so is the fact that 35.3% did not buy the product again in the second year that surprising? It does not mean that they are “lost”; most of them will buy Alpha again sometime the following year (year 3). Similarly, are those who did not buy the product in period 1 but did in period 2 new customers? Some probably are. But, for any established product category, most are probably people who have purchased the product in previous years and who, for whatever reason, did not buy it that year. The key to analysing such tables is to compare them to a benchmark that assumes stable underlying purchasing patterns.
6.1.3 Temporal Variation in Customer-level Spend - by Category Spend
Let’s explore how to create the joint distribution of category spend in years 1 and 2. The logic follows that of the binning of spend used to create the distribution of category spend in year 1 in the earlier section.
It is important to note that the initial aggregation by panel_id and year in the long-form DataFrame does not include entries for panellists who made no category purchases in either year 1 or year 2. However, we want to account for panellists who made a category purchase in one year but not the other, as well as those who did not make a category purchase in either year. When the data is pivoted, some spend range combinations for year 1 and year 2 contain null values, indicating that those panellists made a purchase in one year but not the other. Additionally, the DataFrame is missing panellists who made no purchases in both year 1 and year 2. To address the first issue, we replace the null values with float('-inf') (negative infinity) in place of the (-inf, 0] bin. To resolve the second issue, we calculate the total number of panellists in the DataFrame, subtract the count of existing panellists from the total panel_size, and add this difference back to the DataFrame.
Code
# Summarizes Category Spend by year and panelistjoint_dist_spend = ( grocery_lf .with_columns( (pl.col('week') /52).ceil().cast(pl.UInt16).alias('year'), (pl.col('units') * pl.col('price')).cast(pl.Float32).alias('spend') ) .group_by(['panel_id', 'year']) .agg(pl.col('spend').sum()) .with_columns(pl.col('spend').cut(breaks=np.arange(0, 51, 5), include_breaks=True).alias('cut')) .with_columns(pl.col('cut')).unnest('cut') .collect() .pivot( values='breakpoint', index='panel_id', on='year' ) .fill_null(float('-inf'))# Create joint distribution .rename({'1': 'Year 1', '2': 'Year 2'}) .group_by(['Year 1', 'Year 2']) .agg(pl.len().alias('count')) )# Add zero-spend panellists# The joint distribution of spend so far is for those panellists that made at least one category purchase across the two years. # We want to also include those panellists that didn’t make a category purchasezero_purchasers = pl.Series([panel_size - joint_dist_spend['count'].sum()]).cast(pl.UInt32)joint_dist_spend = joint_dist_spend.vstack( pl.DataFrame({'Year 1': [float('-inf')], 'Year 2': [float('-inf')], 'count': zero_purchasers}))# Create the year 1 x year 2 distribution matrixjoint_dist_spend_pivot = ( joint_dist_spend .sort('Year 2') .pivot( values='count', index='Year 1', on='Year 2', aggregate_function='sum' ) .rename({'Year 1': 'Spend'}) .sort('Spend') .fill_null(0))print('Joint Distribution:')display(joint_dist_spend)print('Joint Distribution - Year 1 x Year 2 Distribution Matrix:')display(joint_dist_spend_pivot)
Joint Distribution:
shape: (126, 3)
Year 1
Year 2
count
f64
f64
u32
20.0
25.0
86
5.0
5.0
260
45.0
15.0
1
50.0
20.0
2
20.0
5.0
35
…
…
…
25.0
50.0
5
30.0
50.0
2
10.0
15.0
196
inf
-inf
1
-inf
-inf
183
Joint Distribution - Year 1 x Year 2 Distribution Matrix:
shape: (12, 13)
Spend
-inf
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
inf
f64
u32
u32
u32
u32
u32
u32
u32
u32
u32
u32
u32
u32
-inf
183
125
88
38
9
2
1
0
1
0
0
0
5.0
129
260
211
103
30
5
9
1
1
0
0
1
10.0
87
217
379
196
113
52
17
7
2
0
0
3
15.0
28
107
248
223
140
78
24
15
6
4
7
2
20.0
12
35
102
164
138
86
43
19
13
5
2
4
…
…
…
…
…
…
…
…
…
…
…
…
…
35.0
0
3
10
18
28
38
17
32
23
11
9
4
40.0
0
0
3
7
22
13
14
17
19
9
8
10
45.0
0
0
0
1
7
8
11
10
6
14
6
5
50.0
0
1
0
2
2
7
4
6
3
6
5
12
inf
1
1
0
3
4
5
8
13
9
7
11
47
Code
relabel = pl.Series(['-∞ - 0'if lb <0elsef'{lb} - {lb +5}'if lb !=50elsef'{lb} - ∞'for lb inrange(-5, 55, 5)])joint_dist_spend_pivot = joint_dist_spend_pivot.select(pl.col('*').exclude('Spend')).with_columns(relabel.alias('Spend'))( GT(joint_dist_spend_pivot, rowname_col='Spend') .tab_header(title='Joint Distribution of Category Spend in Year 1 & 2') .tab_spanner(label='Range of Spend in Year 2', columns=joint_dist_spend_pivot.columns[:-1]) .tab_stubhead(label='Year') .cols_label({joint_dist_spend_pivot.columns[:-1][i]: relabel[i] for i inrange(len(relabel))}) .cols_align(align='center') .fmt_integer() .data_color( columns=joint_dist_spend_pivot.columns[:-1], domain=[0, 400], palette=['white', 'rebeccapurple'], ) .tab_source_note(md("""**Read Row/Column as**: _Range of Spend in Year 1 / Range of Spend in Year 2_""")) )
Joint Distribution of Category Spend in Year 1 & 2
Year
Range of Spend in Year 2
-∞ - 0
0 - 5
5 - 10
10 - 15
15 - 20
20 - 25
25 - 30
30 - 35
35 - 40
40 - 45
45 - 50
50 - ∞
-∞ - 0
183
125
88
38
9
2
1
0
1
0
0
0
0 - 5
129
260
211
103
30
5
9
1
1
0
0
1
5 - 10
87
217
379
196
113
52
17
7
2
0
0
3
10 - 15
28
107
248
223
140
78
24
15
6
4
7
2
15 - 20
12
35
102
164
138
86
43
19
13
5
2
4
20 - 25
4
19
52
82
90
81
63
25
17
9
5
3
25 - 30
0
4
21
29
54
47
45
22
21
6
2
5
30 - 35
0
3
10
18
28
38
17
32
23
11
9
4
35 - 40
0
0
3
7
22
13
14
17
19
9
8
10
40 - 45
0
0
0
1
7
8
11
10
6
14
6
5
45 - 50
0
1
0
2
2
7
4
6
3
6
5
12
50 - ∞
1
1
0
3
4
5
8
13
9
7
11
47
Read Row/Column as: Range of Spend in Year 1 / Range of Spend in Year 2
6.1.4 Repeat Rates
One common summary measure is the repeat rate (or repeat-buying rate), which is defined as the percentage of the brand’s customers in a given period who also purchase the product in the following period. We now explore how to compute quarter-by-quarter repeat rates.
We will carry out the analysis and plot the results for Alpha.
The first thing we do is create a quarterly incidence matrix that indicates whether or not each panellist purchased Alpha each quarter, i.e. a table that indicates whether or not each panellist made at least one purchase of Alpha in each of the eight quarters in our dataset (i.e., an ever-buyers (by quarter) table for Alpha).
We create a pivot table where rows represent panel_id and columns represent quarter.
Each cell indicates whether a panelist purchased in that quarter (1 for yes, 0 for no).
Recall that the repeat rate is the percentage of a brand’s customers in a given period who also purchase the product in the following period. We compute the repeat-buying rate (RBR) using the function:
- where, $Q_{q,i}$ represent the value of quarter $q$ for panellist $i$. It is $1$ if the panellist purchased in quarter $q$, and $0$ otherwise.
- $N$ represent the total number of panellists.
- $Q_{q,i} \cdot Q_{q+1,i}$ represents element-wise multiplication for panellist $i$. It is $1$ if the panellist purchased in both quarters $q$ and $q+1$, and $0$ otherwise.
- $\sum_{i=1}^{N} Q_{q,i} \cdot Q_{q+1,i}$ counts the number of panellists who purchased in **both** quarter $q$ and quarter $q+1$, at least once.
- $\sum_{i=1}^{N} Q_{q,i}$ counts the total number of panellists (households) who purchased in quarter $q$.
- $\text{RBR}(q \to q+1)$ measures the proportion of buyers in quarter $q$ who purchased again in quarter $q+1$.
- The element-wise operation ensures the calculation is based on individual panellists' purchase patterns.
- $Q_{q,i} \cdot Q_{q+1,i}$ leverages the binary nature of $Q$, where the product is $1$ only when both values are $1$. The formula aggregates over all panellists ($i$) to compute both the numerator and denominator.
Code
alpha_qtrly_incidence = ( grocery_lf .filter(pl.col('brand') =='Alpha') .with_columns((np.floor((pl.col('week') -1) /13) +1).cast(pl.UInt16).alias('quarter')) .group_by('panel_id', 'quarter') .agg(pl.lit(1).alias('incidence')) # Mark 1 for incidence .collect() .pivot( values='incidence', index='panel_id', on='quarter', aggregate_function='max'# Ensures presence is represented as 1 ) .fill_null(0) # Replace missing quarters with 0)# Calculate Repeat Buying Rate (RBR)rbr = []num_quarters = alpha_qtrly_incidence.shape[1] -1# Exclude 'panel_id' columnfor q inrange(1, num_quarters): # Iterate through quarter pairs dynamically prev_col =str(q) next_col =str(q +1)# Compute RBR for each quarter pair rbr_value = ( (alpha_qtrly_incidence[prev_col] * alpha_qtrly_incidence[next_col]).sum() / alpha_qtrly_incidence[prev_col].sum() ) rbr.append(rbr_value)# Convert RBR values to a DataFrame for clarityrbr = pl.DataFrame({"Quarter Pair": [f"{q}->{q+1}"for q inrange(1, num_quarters)], "RBR": rbr})display(alpha_qtrly_incidence)display(rbr)
shape: (3_142, 9)
panel_id
5
1
8
7
4
3
6
2
u32
i32
i32
i32
i32
i32
i32
i32
i32
3110931
1
1
1
0
0
1
1
1
3103609
1
1
0
1
0
1
1
0
3114935
0
0
1
0
0
0
1
0
3121926
1
1
0
1
1
1
1
1
3107362
1
1
1
1
1
1
0
1
…
…
…
…
…
…
…
…
…
3110122
1
0
0
0
0
0
0
0
3114044
0
0
0
1
0
0
0
0
3103344
0
1
0
0
0
0
0
0
3102156
0
1
0
0
0
0
0
0
3113763
0
0
0
0
0
0
1
0
shape: (7, 2)
Quarter Pair
RBR
str
f64
"1->2"
0.656682
"2->3"
0.614922
"3->4"
0.549313
"4->5"
0.593472
"5->6"
0.691401
"6->7"
0.597488
"7->8"
0.525498
Code
# Prepare the incidence matrixalpha_qtrly_incidence = ( grocery_lf .filter(pl.col('brand') =='Alpha') .with_columns((np.floor((pl.col('week') -1) /13) +1).cast(pl.UInt16).alias('quarter')) .group_by('panel_id', 'quarter') .agg(pl.lit(1).alias('incidence')) # Mark 1 for incidence .collect() .pivot( values='incidence', index='panel_id', on='quarter', aggregate_function='max'# Ensures presence is represented as 1 ) .fill_null(0) # Replace missing quarters with 0)# Get all quarters dynamicallyquarters = alpha_qtrly_incidence.columns[1:] # Exclude 'panel_id'# Compute RBR for all quarter pairs using vectorized operationsrbr = ( pl.DataFrame( {f"Q{q} -> Q{q+1}": [ ( (alpha_qtrly_incidence[str(q)] * alpha_qtrly_incidence[str(q +1)]).sum() / alpha_qtrly_incidence[str(q)].sum() ) ] # Wrap in a list for DataFrame creationfor q inrange(1, len(quarters)) } ) .transpose(include_header=True, header_name='Quarter Pair', column_names=['RBR']) # Transpose for desired layout & rename resulting columns)
Code
alt.Chart(rbr).mark_line(strokeWidth=1).encode( x = alt.X('Quarter Pair:O', title='Quarter', axis=alt.Axis(labelAngle=0)), y = alt.Y('RBR:Q', title='Repeat Rate', axis=alt.Axis(format=",.0%"), scale=alt.Scale(domain=[0, 0.8]))).properties( width=650, height=250,title='Plot of quarterly repeat rates for Alpha').configure_view(stroke=None)
We see that there is some variability in the repeat rate. Is it possibly in decline? One thing to realise is that the variability can be driven by the firm’s promotional activities. Aggressive promotions may attract a segment of consumers that only buy on promotion. If the firm promotes heavily in one quarter and has fewer promotions the following quarter, we could expect to a drop in the repeat rate. We can explore this by looking at how Alpha’s promotional activity (as reflected in price/kg) varied across quarters.
exceptions = ['kiwi_lf', 'In', 'exceptions', 'active_variables']active_variables = [ var for var, value inglobals().items()ifnot var.startswith('_') # Exclude variables that start with "_"and var notin exceptions # Exclude variables in the exceptions listandisinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only]for var in active_variables:delglobals()[var]del active_variables, exceptions, vargc.collect()
375
6.2 New Products
Refer to Depth-of-Repeat Sales Summary for more details
Central to diagnosing the performance of a new product is the decomposition of its total sales into trial and repeat sales. A given overall aggregate sales history could be the realization of very different purchasing scenarios. For example, a low sales level for a new product could be the result of (i) many consumers making a trial (i.e., first-ever) purchase but few of them making a repeat purchase (because the product does not meet their expectations), or (ii) a low trial rate but a high level of repeat purchasing amongst the triers (because the product meets a real need among a relatively small set of buyers). Without a trial/repeat sales decomposition, it is impossible to determine which of these scenarios best describes the sales data. (And of course it is impossible to make such decompositions without access to household-level purchasing data (i.e., consumer panel data).)
We conduct a trail/repeat sales decomposition below, the plot produced shows the week-by-week sales of Kiwi Bubbles (dataset 2) broken down into trial and repeat sales (i.e., first-ever purchases of the new product by a panellist vs. subsequent purchases of the new product by a panellist).
Another variant of the sales decomposition is conducted in the section decomposing cumulative sales into its trial, first repeat, and additional repeat components in which we plot cumulative sales and separates repeat sales into first repeat (second-ever) purchases and additional repeat (second repeat (third-ever) purchases + third repeat (fourth-ever) purchases + . . . ). For this plot, we see that 45% of Kiwi Bubble’s year 1 sales came from trial purchases and 38% were due to additional repeat purchases.
Given the data summary from which these two plots are created, we can create further plots that give us insight into buyer behaviour. For example, we can plot cumulative trial (growth in cumulative trial over time), which shows the percentage of households in the market that have made a trial purchase by any week in the new product’s first year on the market. (We see that just under 10% of households have tried the new product by year end. This still appears to be growing but at a far slower rate than earlier on in the year.)
People can be induced to try a new product by promotions, etc. but the key to success is repeat purchasing. The first step is making a first-repeat purchase. One useful metric to track is “percent triers repeating”—the percentage of panellists that made a trial purchase that have gone on to make a (first) repeat purchase. The associated plot for Kiwi Bubbles is constructed in the section Evolution of the percentage of triers that have made a repeat purchase.
6.2.1 Data Preparation for Basic Analyses of New Product Performance
Our initial goal is to create a summary of new product purchasing from which decompositions can be computed and associated plots can be created.
We will focus on purchasing by those panellists in Market 2. Here are the steps: 1) Set num_panellists for Market 2 as 1,499 2) Filter dataset for column Market where market is equal to 2. 3) Add a day of year (DoY) column which is computed as: (Week of the Year - 1) * 7 + Day of the Week = Day of Year 4) Create a Depth of Repeat (DoR) level associated with each transaction, where 0 = Trial Purchase, 1 = First Repeat Purchase, 2 = Second Repeat Purchase, and so on. This is done by using Polars’ cum_count() and over() expression. Note: this operations requires the records to be sorted chronologically for each panellist. - cum_count() computes a running count of rows within each group, starting from 0 for the first appearance of the ID. - over() expression is a window function. It performs calculations across a subset of rows, grouped by one or more columns, without collapsing or aggregating the data at the group level. - The over("ID") groups rows by ID and calculates the cumulative count only within each group. - Results are returned in the same row layout, preserving the original structure. 5) Once we start to process the data further to summarize how many trails, first repeat purchase, etc. purchase occured in each week, we want there to be all 52 weeks of the year, to be available in the summary. However, it turns out that this panel of 1499 households only purchased the test product in 49 weeks; no purchases occurred in weeks 25, 39, and 41. We need to append the missing weeks to the dataframe constructed so far and consider null values for rest of the columns. 6) We create a pivoted, Transaction DoR by Week and Total Unit Sales DoR by Week copy of the processed dataframe.
Code
# Number of panellists in Market 2num_panellists =1499kiwi_df = ( kiwi_lf .filter(pl.col('Market') ==2) .drop('Market')# DoY -> day of year variable, where the 1 corresponds to the day the new product was launched. .with_columns(((pl.col('Week') -1) *7+ pl.col('Day')).alias('DoY')) .sort(by=['ID', 'DoY']) .with_columns( (pl.col("ID").cum_count().over("ID") -1).alias("DoR") ) .collect())kiwi_df
shape: (306, 6)
ID
Week
Day
Units
DoY
DoR
u16
i16
i16
i16
i16
u32
20001
49
1
1
337
0
20002
14
7
1
98
0
20003
24
6
1
167
0
20004
49
2
1
338
0
20005
6
7
1
42
0
…
…
…
…
…
…
20136
26
6
1
181
5
20136
37
7
1
259
6
20137
13
6
1
90
0
20138
18
7
1
126
0
20139
49
3
1
339
0
Code
# The smallest unit of time for our analyses is day. Ensure we do not have any panellist with more than one transaction on any day.tmp_kiwi = ( kiwi_df .group_by('ID', 'DoY') # group_by function aggregates columns passed as parameters together by collapsing the data at the group level if values are the same or overlapping. .agg(pl.col('Units').sum()))print("1) There aren't multiple transaction records (rows) by the same panellist on the same day:")print(len(kiwi_df) ==len(tmp_kiwi)) # If comparison return True, then length of the two datasets is the same and the group_by function did not find same/overlapping transactions by the same ID on the same DoY.# Did at least one panellist make a purchase of this new product each week?print("2) Number of weeks where at least one unit of the new product was purchased/sold:")print(len(kiwi_df['Week'].unique())) # No, there are only 49 weeks of the year where at least one unit of the new product was sold# Which weeks are missing?print("3) Weeks with no transactions:")missing_weeks =list(set(np.arange(1, 53, dtype='int16')) -set(kiwi_df['Week'].unique()))print(missing_weeks)
1) There aren't multiple transaction records (rows) by the same panellist on the same day:
True
2) Number of weeks where at least one unit of the new product was purchased/sold:
49
3) Weeks with no transactions:
[np.int16(25), np.int16(41), np.int16(39)]
Code
missing_df = pl.DataFrame( {col: [None]*len(missing_weeks) for col in kiwi_df.columns} ).with_columns( pl.Series('Week', values = missing_weeks) )kiwi_df = kiwi_df.vstack(missing_df)kiwi_df
shape: (309, 6)
ID
Week
Day
Units
DoY
DoR
u16
i16
i16
i16
i16
u32
20001
49
1
1
337
0
20002
14
7
1
98
0
20003
24
6
1
167
0
20004
49
2
1
338
0
20005
6
7
1
42
0
…
…
…
…
…
…
20138
18
7
1
126
0
20139
49
3
1
339
0
null
25
null
null
null
null
null
41
null
null
null
null
null
39
null
null
null
null
Code
trans_dor_week = ( kiwi_df .group_by(['Week', 'DoR']) .agg(pl.len().cast(pl.UInt8).alias('Count')) # Count occurrences for each (Week, DoR) pair .sort('DoR') .pivot(index='Week', on='DoR', values='Count') # Pivot the data .fill_null(0) # Fill any missing values with 0 .sort('Week') # Ensure the rows are sorted by 'Week' .drop('null') .with_columns(pl.sum_horizontal(pl.col('*').exclude('Week')).alias('Total')) # Calculate total count per Week)trans_dor_week
shape: (52, 14)
Week
0
1
2
3
4
5
6
7
8
9
10
11
Total
i16
u8
u8
u8
u8
u8
u8
u8
u8
u8
u8
u8
u8
u8
1
8
1
0
0
0
0
0
0
0
0
0
0
9
2
6
0
0
0
0
0
0
0
0
0
0
0
6
3
2
1
0
0
0
0
0
0
0
0
0
0
3
4
16
2
0
0
0
0
0
0
0
0
0
0
18
5
8
3
0
0
0
0
0
0
0
0
0
0
11
…
…
…
…
…
…
…
…
…
…
…
…
…
…
48
1
1
1
1
0
0
0
1
0
0
0
0
5
49
4
0
0
0
0
2
0
1
1
0
0
0
8
50
0
2
0
0
0
0
0
1
2
1
1
1
8
51
0
1
0
0
0
0
0
1
0
0
0
0
2
52
2
1
1
1
0
0
0
0
0
1
0
0
6
Code
qty_dor_week = ( kiwi_df .group_by(['Week', 'DoR']) .agg((pl.col('Units').sum()).alias('Total Units')) # Count occurrences for each (Week, DoR) pair .sort('DoR') .pivot(index='Week', on='DoR', values='Total Units', aggregate_function='sum') # Pivot the data .fill_null(0) # Fill any missing values with 0 .sort('Week') # Ensure the rows are sorted by 'Week' .drop('null') .with_columns(pl.sum_horizontal(pl.col('*').exclude('Week')).alias('Total')) # Calculate total count per Week)qty_dor_week
shape: (52, 14)
Week
0
1
2
3
4
5
6
7
8
9
10
11
Total
i16
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
i64
1
9
1
0
0
0
0
0
0
0
0
0
0
10
2
6
0
0
0
0
0
0
0
0
0
0
0
6
3
2
1
0
0
0
0
0
0
0
0
0
0
3
4
19
3
0
0
0
0
0
0
0
0
0
0
22
5
8
3
0
0
0
0
0
0
0
0
0
0
11
…
…
…
…
…
…
…
…
…
…
…
…
…
…
48
1
1
1
1
0
0
0
1
0
0
0
0
5
49
4
0
0
0
0
2
0
2
1
0
0
0
9
50
0
2
0
0
0
0
0
1
3
2
2
1
11
51
0
2
0
0
0
0
0
2
0
0
0
0
4
52
2
1
1
2
0
0
0
0
0
2
0
0
8
6.2.2 Basic Analyses of New Product Performance
6.2.2.1 Trial/Repeat Decomposition of Sales
Now we will plot the trial/repeat decomposition of sales, which breaks weekly sales into its trial and repeat components, in the following manner:
We not plot the cumulative growth in the percentage of households (% of panel size) that have made a trial purchase (sometimes called cumulative penetration), in the following manner:
6.2.2.4 Evolution of the percentage of triers that have made a repeat purchase
We now plot the evolution of the percentage of panellists that made a trial purchase that have gone on to make a (first) repeat purchase, in the following manner:
Looking at the percentage of triers repeat purchasing plot from earlier, we see that the appeal of the new product is such that just over 37% of those customers that made a trial purchase in the first year it was on the market ended up making a repeat purchase of the new product in that year. Note that these % triers repeating numbers plotted are in calendar time (or, more precisely, time since the launch of the product). We can compute a related measure that tells us how many weeks after their trial purchase a panellist makes their (first) repeat purchase.
First Repeat Curve
The goal of the code is to compute and visualize the cumulative percentage of triers (consumers) who made their first repeat purchase a certain number of weeks after their trial purchase. The analysis includes: 1. Identifying trial weeks (Trial Week) and the time difference (FR Delta) between the trial purchase and the first repeat purchase. 2. Aggregating data to determine the counts of first repeats per trial week and calculating cumulative first repeat as a percentage of total trial week class buyers. 3. Visualizing these cumulative percentages for triers making their first repeat purchase.
As a first step, we create a summary that tells us the cumulative percentage of triers that have made a (first) repeat purchase so many weeks after their trial purchase, broken down by time of trial.
Step 1: Trial Week and First Repeat Delta (FR Delta) Computation
We start by creating a “week of trial purchase” variable (Trial Week) and a variable the counts the number of weeks between a panellist’s trial and first repeat purchase (FR Delta). - Trial Week variable equals the value of the week if this is a trial purchase and None otherwise. - The week of the trial purchase is recorded if it is the panellist’s first purchase (DoR = 0). Otherwise, it is set to None.
FR Delta varilable tells us how many weeks after the trail purchase the panellist’s first repeat purchase occured (assuming it was observered). This is computed using shift method as follows:
We shift the DoR and Week columns by 1 row to get access to the next row’s values.
Then, use a pl.when condition to compute the FR Delta based on whether the next row’s DoR is 1.
If the DoR in the next row is 1, compute the difference between the next row’s Week and the current row’s Trial Week. Otherwise, assign None.
None indicates that a first repeat purchase was not observed for this panellist in the 52-week observation period. Strictly speaking, FR delta = 0 if the first-repeat purchase occurs in the same calendar week as the trial purchase, 1 if the first-repeat purchase occurs in the calendar week immediately after that in which the trial purchase occurred, etc.
Step 2: Aggregate Counts of First Repeats & Total Triers
Next, we aggregate the number of IDs (Count) for each combination of Trial Week and FR Delta. Rows with Trial Week and FR Delta both being None are excluded.
Additionally, we aggreate the total number of triers in each Trial Week class. (Sum of ID counts for each Trial Week class) This aggregation includes both, triers that did make a repeat purchase and triers that did not make a repeat purchase. As expeceted, not all triers made a repeat purchase. - For example, we see that eight panellists made a trial purchase during the first week Kiwi Bubbles was on the market. One panellist purchased the product again that same week. Another made their first repeat purchase two weeks after their trial. Summing the count of week-1 triers who made their first repeat purchase within the 52-week observation period, we see that five of the eight week-1 triers (62.5%) had made a first repeat purchase.
Step 3: Include Missing Combinations & Perform Cumulative Operations
We cannot assume that all the trial and “time from trial to FR” weeks are observed in the dataset, so we need to fill in the missing values.
Meshgrid of Trial Week and FR Delta:
Create a dummy dataset of all possible combinations of Trial Week (1–52) and FR Delta (0–51). This ensures all potential weeks are included, even if they are absent in the observed data.
There are 52 weeks in a year and if a trier were to purchase in Week 1, the latest he/she can do their first repeat purchase and still be included in our analysis is by Week 52 or 51 weeks after the trail purchase (FR Delta). Additionally, given that a trier can do a first repeat purchase in the same week as the week of purchase, FR Delta can be 0. Thus the range for FR Delta is 0 to 51.
As we have 52 weeks of data, the only time someone who made a trial purchase in week 52 could make a repeat purchase is in week 52. (One week after trial would be week 53, which we do not observe.)
Join with Aggregated Data:
Combine the dummy dataset with:
agg_trial_fr: Aggregated counts of first repeat purchases grouped by Trial Week and FR Delta.
trial_week_total: Total triers in each Trial Week.
Cumulative Percentage of Triers that made a First Repeat Purchase:
We must compute the cumulative percentage of triers that have made a first repeat purchase so many weeks after their trial purchase.
First, we calculate cumulative counts of first repeat purchases (Cum FR by Week) within each Trial Week class.
Second, we compute the cumulative percentages of first repeats (Cum FR by Trial):
\[
\text{Cum FR by Trial} =
\begin{cases}
\frac{\text{Cum FR by Week}}{\text{Total Triers}}, & \text{if Total Triers} > 0 \\
0, & \text{otherwise}.
\end{cases}
\]
Handling Missing Values:
If the number of remaining weeks after the trial is less than the FR Delta (e.g., Week 52 triers cannot have an FR Delta greater than 0), assign None to such rows.
Compute Weights:
The cumulative percentage values (Cum FR by Trial) are weighted by the number of triers (Total Triers) to compute a weighted average for visualization.
Ensures that larger trial cohorts contribute proportionally more to the overall average, avoiding bias from small sample sizes.
These weights are later used to calculate the weighted cumulative percentage of first repeat purchases across all trial weeks.
Step 4: Filter and Compute Time to First Repeat Curve
This step takes the cumulative dataset (cum_fr_by_trial), filters it to focus on valid ranges for visualization, and computes the weighted average of cumulative first repeat percentages across trial weeks.
Filter Data to Valid Range
Focus on the first 26 weeks of trial (Trial Week <= 26) and corresponding first repeat purchases up to 26 weeks after the trial (FR Delta <= 26).
We restrict the analysis to a meaningful range to ensure only cohorts with sufficient observation time are included in the visualization.
For instance, triers in Trial Week 26 have exactly 26 weeks to make their first repeat purchase. Including them for longer periods would be misleading since no data is observed beyond 26 weeks.
While we could create a weighted average of all 52 rows, it would be misleading in that we would be mixing groups of triers with different numbers of weeks in which they could have made a repeat purchase. (A week-51 trier has only two weeks in which to make a first repeat purchase: weeks 51 and 52.) Given a 52-week observation period and desire to compute the percentage of triers making a first repeat purchases 0, 1, . . . , x weeks after their trial purchase, we can only consider those triers that made a trial purchase in the first 52−x weeks the product was on the market. For this illustrative example, we will create a plot of the cumulative percentage of triers that have a made a first repeat purchase within 26 weeks of their trial purchase.
Aggregate Across Trial Weeks
The filtered data is grouped by FR Delta (weeks after trial purchase) to aggregate:
Total Triers: Total number of trial purchases within the valid range.
Weights: Summed weights for cumulative first repeat percentages across all Trial Week cohorts.
Using the aggregated Weights and Total Triers, compute the weighted average of cumulative first repeat percentages:
\[
\text{Time to FR} = \frac{\text{Weights (Summed)}}{\text{Total Triers (Summed)}}
\]
This formula ensures that the percentage of first repeats for each FR Delta is weighted by the size of the trial cohort (Total Triers) contributing to that value.
Larger cohorts (e.g., triers in Week 4) have a proportional impact on the overall trend, avoiding distortion caused by smaller cohorts (e.g., triers in Week 13 or 22).
Code
kiwi_trial_fr = ( kiwi_lf .filter(pl.col('Market') ==2) .drop('Market')# DoY -> day of year variable, where the 1 corresponds to the day the new product was launched. .with_columns(((pl.col('Week') -1) *7+ pl.col('Day')).alias('DoY')) .sort(by=['ID', 'DoY']) # Sorting is necessary to organize the data so that DoR can be computed .with_columns((pl.col("ID").cum_count().over("ID") -1).alias("DoR")) .with_columns(# Trial Week = Week when DoR == 0, else None pl.when(pl.col('DoR') ==0).then(pl.col('Week')).otherwise(None).alias('Trial Week'),# Shift DoR and Week to get the next row's values pl.col('DoR').shift(-1).alias('Next_DoR'), pl.col('Week').shift(-1).alias('Next_Week') ).with_columns(# Compute FR Delta based on next row's DoR and Week pl.when(pl.col('Next_DoR') ==1) .then(pl.col('Next_Week') - pl.col('Trial Week')) .otherwise(None) .alias('FR Delta') ).select('ID', 'Trial Week', 'FR Delta') # Drop unused & temporary columns)# Count of panellists for each combination of Trial Week and FR Deltaagg_trial_fr = ( kiwi_trial_fr .group_by(['Trial Week', 'FR Delta']) .agg(pl.len().alias('Count')) .filter(~pl.all_horizontal(pl.col('*').exclude('Count').is_null()) # Drop rows if 'Trial Week' and 'FR Delta' columns are null ) )# Sum of triers in each Trial Week classtrial_week_total = ( agg_trial_fr .group_by('Trial Week') .agg(pl.col('Count').sum().alias('Total Triers')) .sort('Trial Week') .fill_null(0) )# Dummy dataset of all combinations of Trial Weeks and FR Delta within the period of analysis - handle missing combinationstrial_week_range, fr_delta_range = np.meshgrid(np.arange(1, 53, dtype='int16'), np.arange(0, 52, dtype='int16'))trial_fr_range = pl.LazyFrame({'Trial Week': trial_week_range.reshape(-1), 'FR Delta': fr_delta_range.reshape(-1)})# Join dummy dataset with aggregated cum_fr_by_trial = ( trial_fr_range .join(trial_week_total, on='Trial Week', how='left') .join(agg_trial_fr, on=['Trial Week', 'FR Delta'], how='left') .fill_null(0) .with_columns(pl.col('Count').cum_sum().over('Trial Week').alias('Cum FR by Week')) # calculate cumulative counts of first repeat purchases within each Trial Week class .with_columns( pl.when(pl.col('Trial Week') > (52- pl.col('FR Delta'))) .then(None) # If the number of remaining weeks after the trial is less than the `FR Delta` (e.g., Week 52 triers cannot have an `FR Delta` greater than 0), assign `None` to such rows. .otherwise( pl.when(pl.col('Total Triers') >0) .then(pl.col('Cum FR by Week') / pl.col('Total Triers')) # compute the cumulative percentages of first repeats .otherwise(0) ) .alias('Cum FR by Trial') ) .with_columns( (pl.col('Cum FR by Trial') * pl.col('Total Triers')).alias('Weights') # Compute weights ))
Code
time_to_fr_filtered = ( cum_fr_by_trial .filter( (pl.col('Trial Week') <=26) & (pl.col('FR Delta') <=26) ) .group_by('FR Delta') .agg( pl.col('Total Triers').sum(), pl.col('Weights').sum() ) .with_columns((pl.col('Weights') / pl.col('Total Triers')).alias('Time to FR')) # compute the weighted average of cumulative first repeat percentages across trial weeks .sort('FR Delta') .collect())time_to_fr_filtered
shape: (27, 4)
FR Delta
Total Triers
Weights
Time to FR
i16
u32
f64
f64
0
101
4.0
0.039604
1
101
10.0
0.09901
2
101
16.0
0.158416
3
101
18.0
0.178218
4
101
21.0
0.207921
…
…
…
…
22
101
40.0
0.39604
23
101
40.0
0.39604
24
101
41.0
0.405941
25
101
41.0
0.405941
26
101
42.0
0.415842
Code
alt.Chart(time_to_fr_filtered).mark_line().encode( x=alt.X('FR Delta:O', title='Weeks After Trial', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 2), labelExpr="datum.value")), y=alt.Y('Time to FR:Q', title='% of Triers', axis=alt.Axis(format='.0%')),).properties( width=650, height=250, title='Time to First Repeat - Empirical distribution of the time from trial to first repeat purchase').configure_view(stroke=None)
We see that 42% of triers have made a first repeat purchase within a year of their trial purchase. This is higher than the 37% of triers that have made a repeat purchase by the end of week 52 reported in the % of triers repeat purchasing plot from earlier. Why is this the case? This is because the plot in the earlier section considered the all triers in the first year, whereas the plot in this section look at the triers up until the 26th week. Considering a small trier base acquired earlier in the year, we aloow for more sufficient time for triers to make their first repeat purchase. Given that the earlier plot considers all triers up until the last week of the obeservation period, we are unable to provide sufficient time for those triers make a repeat purchase.
alt.Chart(time_to_2r_filtered).mark_line().encode( x=alt.X("2R Delta:O", title='Weeks After First Repeat', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 2), labelExpr="datum.value")), y=alt.Y('Time to 2R:Q', title='% of First Repeaters', axis=alt.Axis(format='.0%')),).properties( width=650, height=250, title='Time to Second Repeat - Empirical distribution of the time from first repeat to second repeat purchase').configure_view(stroke=None)